Swire Coca-Cola operates extensively, producing, selling, and distributing beverages across 13 states in the American West. The company is known for its regular introduction of new, often limited-time, products aimed at stimulating market demand. However, predicting accurate demand for these innovative products remains a significant challenge. Historical data provides some insight but is insufficient for precise forecasting due to variations in regional and demographic appeal.
The firm stands at the forefront of the Western US beverage distribution sector, continually launching limited edition products to maintain consumer interest and market dominance. Yet, the uncertainty in demand forecasting for these unique items presents risks of either overproduction or shortages, each carrying potential financial and reputational impacts. The project aims to leverage demographic and historical sales data to enhance the accuracy of demand predictions for future innovative offerings.
A precise demand forecast for Swire Coca-Cola’s innovative products is imperative for the company to meet all potential demand without overproducing, thereby maximizing revenue and elevating customer satisfaction. By strategically launching the right products in suitable regions at optimal times, Swire can attain a market advantage, reinforce its position as an industry trendsetter, and foster brand loyalty among increasingly diverse consumers.
Our analytics team is set to use Exploratory Data Analysis (EDA), Time Series Forecasting techniques like ARIMA and SARIMA models, and Machine Learning algorithms to analyze over 24 million observations collected over three years. This vast dataset includes sales data for more than 3100 unique products across 13 western states with varied demographics. Our goal is to identify the optimal launch periods for new products and pinpoint the essential attributes contributing to their success, while also managing production-related costs effectively. We will adhere to standard data science methodologies, including the division of data into training and testing sets to ensure the validity of our model predictions and demand forecasts.
The project presents several analytical challenges. Firstly, we must determine the best approach to integrate external data sources, such as demographic and zip code information, with our primary market demand dataset to enhance regional and demographic-specific demand forecasting. Secondly, the variability in the release durations of historical and future products, ranging from 13 weeks to six months, necessitates tailored analytical approaches to utilize historical data effectively. By addressing these challenges, our team aims to develop robust models that will enable Swire Coca-Cola to accurately forecast demand for innovative products, thereby optimizing inventory levels and enhancing customer satisfaction.
Our success hinges on delivering accurate demand forecasts enabling Swire to balance supply with demand efficiently, ensuring no surplus stock or unmet consumer needs. The effectiveness of our forecasts will be reflected in the company’s profitability, evaluated through our modeling outcomes against industry-standard performance indicators.
We will measure the project’s success through the precision of our demand predictions for innovative products against real sales data. Key performance indicators include the model’s ability to accurately determine the timing, location, duration, and quantity for each new product launch. These metrics aim to provide Swire with valuable insights, leading to financial benefits and an increased market share for their new offerings.
Our analysis will be supported by one critical dataset, essential for the construction of our predictive models, and a tertiery secondary demographic dataset.
The primary dataset is Market Demand, encompassing historical sales data across diverse product categories, brands, caloric segments, and packaging types. This dataset is extensive, with over 24 million records, presenting significant considerations for data handling and analysis.
The secondary dataset comprises demographic details including age, income, gender, household income, and segmentation, coupled with zip code information. This allows for a nuanced approach to forecasting, enabling predictions tailored to specific demographic profiles based on historical sales patterns and census data.
Collectively, these datasets form a comprehensive time series analysis framework, comprising over 24 million observations across 13 features. This includes detailed sales information spanning three years for Swire Coca-Cola products, covering 13 western states and 3100 unique products, segmented into various sales territories, innovation product lines, and brand categories, alongside metrics for seasonal trends, financial impacts, and market dynamics.
The data set contains NA values for CALORIC_SEGMENT in its raw form comprising 0.2% (59725 missing values) of the data. Using text analysis on ITEM description, imputation was performed to determine the observation’s CALORIC_SEGMENT as either diet/light or regular.
At the outset of this EDA endeavor, the project will attempt to concentrate on crafting predictive models and generating business insights for seven specified new products. Deliverables include a PowerPoint presentation highlighting key insights, an annotated report delineating the code and model outputs, and all relevant code posted to GitHub for transparency. Future expansions may encompass tailored models for specific brands, locations, or periods to further refine forecasting accuracy.
The project team comprises four analysts: Ian Donaldson, Michael Tom, Andrew Walton, and Jake Jarrard. The project is slated for completion and presentation to the client on April 11, 2024, with key milestones including exploratory data analysis completion by February 25, model building by March 24, and presentation finalization by April 10. This timeline ensures thorough analysis and model development, culminating in a comprehensive presentation of findings to Swire Coca-Cola. All members of the analytics team contributed equally to the production of this EDA effort and exploration of the data.
The purpose of this EDA notebook is to identify and analyze the Swire Coca-Cola data set looking for hints and clues on guidance of forecasting innovation products in a time series format. Additional feature engineering and data cleaning will be performed to prepare the data for time series forecasting and machine learning models.
By accurately forecasting demand for innovative limited-release beverages, Swire Coca-Cola can make informed decisions to optimize inventory levels, enhance customer satisfaction, and solidify its position as an industry leader in beverage innovation. The collaborative efforts of the analytics team will furnish Swire with the insights needed to navigate the complexities of new product launches successfully. It would be naive to suppose our EDA captured all of the essential elements we would want to know. Its size and complexity will be nearly impossible to fully understand. Ulimately we will need to break the dataset down into useable sub dataframes that will allow for very boutique feature engineering to capture inherent numeric information in each row, by product, flavor, market region, etc. However, this EDA experience in its entirety answers nearly all essential questions beneath the surface of 24 million rows and billions of dollars of soft drink sales in recent years.
We have provided the following EDA in four sections. Summaries of each follow. Some comments are provided in individual sections when necessary. Most comments are captured inside R code output. Most code has been largely suppressed as has lengthy output.
Several insights can be derived from the comprehensive review of the market demand datasheet. This section captures the essentials any data scientist would want to hear up front in terms of describing the data. Highlights as follows…
Upon examination of the DATE column, we observe approximately three years of data. Certain products within the dataset span the entire three years, while others are cataloged for less than six months. Products with shorter durations are likely to be more pertinent to our predictions regarding innovative products.
The analysis of the DATE attribute proves crucial, suggesting the potential utility of segmenting our models based on varying date ranges, such as 13 weeks or less than six months, to align more closely with the specific queries we aim to resolve.
Regarding CATEGORY, while energy drinks generate the highest revenue per unit, Soft Sodas (SSD) emerge as the predominantly sold category. This dynamic warrants closer inspection.
In comparison with other manufacturers, Swire-Coca-Cola (Swire-CC) presents an average performance in terms of total sales revenue and units sold. The concern arises from their lower unit price relative to competitors, leading to higher volume sales required to match the revenue of other manufacturers, which could potentially erode profit margins. Our analysis will thus aim to identify products with higher average unit prices for enhanced profitability.
Between the two caloric segments, diet/light and regular, both command similar prices per unit. However, regular beverages significantly outperform diet/light in both unit sales and total revenue, indicating a stronger market demand for regular sodas.
The primary challenge lies in sifting through the voluminous data to determine the most relevant fields and questions for focus. Identifying these areas will enable us to harness the extensive dataset effectively, feeding into the models we develop for robust analysis.
The sales data analysis of Swire Coca-Cola unveils critical insights. Initially, the dataset had 59,725 missing entries, all addressed through text analysis, achieving a state of zero missing values. The dataset covers 152 unique weeks, providing comprehensive weekly data for nearly three years, showcasing consistency in time-series data across two caloric segments, five categories, eight manufacturers, 319 brands, 103 package types, and 3,692 unique product descriptions.
Delving into product specifics, characteristics emerge, such as Diet Smash belonging to the energy category with three types of packaging. The Sparkling Jacceptabletlester brand offers both diet and regular options, available in diverse packaging, while Venomous Blast provides diet and regular choices, predominantly in “16small multi cup” packages. In manufacturer rankings, Swire holds the third position in sales, with Bubble Joy Advantageous leading in brand sales, followed by Real-Time and Peppy.
Data analysis indicates a left skew in brand run times, displaying a median of 137 weeks against a mean of 99.1 weeks, reflecting varied brand life spans. In contrast, package tenure shows right skewness, with longer tenures for established package sizes, evidenced by a median of 147 weeks and a mean of 117 weeks. However, 14 package sizes exhibit durations of less than six months, potentially signaling new packaging innovations.
Seasonal trends present inconsistencies, with missing weeks and notable fluctuations among all analyzed beverages, prompting further examination to clarify whether these arise from data omissions or natural product lifecycle conclusions. This necessitates meticulous validation of product start and end dates, especially for brands with intermittent data.
In conclusion, the detailed analysis sheds light on significant aspects of Swire Coca-Cola’s sales trends and product dynamics, laying a foundation for enhanced forecasting accuracy and strategic product launch planning.
This segment of the EDA is dedicated to organizing the data into suitable sets for modeling. We aim to identify any data inaccuracies that may lead to incorrect predictions and isolate segments that might introduce noise into the models. The final objective is to discern which attributes of our innovative products currently exist in our dataset, ensuring a clean and relevant data environment for predictive modeling.
During this phase, we undertook a comprehensive review of the datasets intended for model development. Key tasks included identifying multi-category items and filtering out non-essential categories to enhance model clarity. We noted the potential benefit of excluding data on ongoing sales, which could otherwise obscure the models with irrelevant variability. Additionally, our examination of sales tenure distributions by category underscored the need to incorporate these divergences into our modeling strategy effectively.
A crucial part of our evaluation focused on the yearly sales distribution, revealing significant month-to-month and group-to-group variations. Recognizing these temporal differences will be instrumental in shaping our models, especially in predicting seasonal trends and consumer behavior.
Delving into the specifics of modeling our innovation products posed unique challenges, given the sparse data for certain new product launches, including those with novel flavors or categories. Our preliminary analysis highlighted the intricacies and potential obstacles in accurately forecasting demand for these unique items.
Summarizing our findings, the exploration offers critical insights into our data’s complexity and the strategic adjustments needed for our models. As we progress, these insights will guide the enhancement of our modeling techniques, aiming to produce more precise and actionable predictive analytics for our innovative product lines.
We used several tools to reshape the provided demographic data. We transformed and un-pivoted criteria, segment, and count and created a wide format dataframe from the original XLS and preserved all numbers. We conducted simple calculations to capture the aggregate population and households of each zip code and then we used the zip_to_market_unit_mapping.csv to assign each zip code to its appropriate Market Segment. We removed all Zip codes that were not associated with a Market Segment. We renamed the columns as appropriate. Our dataset contains all 2232 Zip codes with their respective demographic data in 83 total columns. The following exploration captures the essential elements of only the demographic data prior to associating it with the main Swire dataset. Despite learning a great deal about the demographic data, it remains to be seen if it can be married properly to the main Swire dataset for use in modeling.
#Quick summary of data inspection and manipulation:
#1: Factored a number of columns
#2: Imputed all NA's (The only NAs were in the CALORIC_SEGMENT column and we used text analytics to impute DIET & REGULAR)
#3: Ensured DATE column data type was convered to a date as.Date()
#4: Created a UNIT_PRICE column: swire_df$UNIT_PRICE <- swire_df$DOLLAR_SALES / swire_df$UNIT_SALES
summary(swire_df)
## DATE MARKET_KEY CALORIC_SEGMENT
## Min. :2020-12-05 1811 : 198609 DIET/LIGHT:12174588
## 1st Qu.:2021-08-14 1172 : 185475 REGULAR :12286836
## Median :2022-04-23 6802 : 159909
## Mean :2022-04-25 602 : 146398
## 3rd Qu.:2022-12-31 1135 : 145933
## Max. :2023-10-28 117 : 141489
## (Other):23483611
## CATEGORY UNIT_SALES DOLLAR_SALES
## COFFEE : 145536 Min. : 0.04 Min. : 0.0
## ENERGY : 5932087 1st Qu.: 11.00 1st Qu.: 36.6
## ING ENHANCED WATER: 2452456 Median : 40.00 Median : 135.1
## SPARKLING WATER : 3019064 Mean : 174.37 Mean : 591.1
## SSD :12912281 3rd Qu.: 126.00 3rd Qu.: 427.1
## Max. :96776.00 Max. :492591.1
##
## MANUFACTURER BRAND
## JOLLYS :6921978 CROWN : 1239010
## SWIRE-CC:5763809 REAL-TIME EDITIONS : 827868
## COCOS :5595540 DIGRESS FLAVORED : 731199
## PONYS :2259095 MYTHICAL BEVERAGE ULTRA: 718536
## BEARS :1593430 BUBBLE JOY : 535030
## ALLYS :1428416 REAL-TIME : 531911
## (Other) : 899156 (Other) :19877870
## PACKAGE ITEM UNIT_PRICE
## 16SMALL MULTI CUP: 3065844 Length:24461424 Min. : 0.00333
## 20SMALL MULTI JUG: 2877015 Class :character 1st Qu.: 2.00429
## 12SMALL 12ONE CUP: 2870763 Mode :character Median : 3.18000
## 2L MULTI JUG : 1896248 Mean : 4.52689
## .5L 6ONE JUG : 1453399 3rd Qu.: 5.34105
## 12SMALL 8ONE CUP : 1408738 Max. :224.99000
## (Other) :10889417
#Outliers:
# UNIT_PRICE: The Max Unit Price seems very high compared to the rest of the observations; no apparent explanation for this as there doesn't seem to be any products in the data set that should sell for that price.
# UNIT_SALES: The Max Unit Sales seems very high compared to the rest of the observations. This could be that they have a certain product that is very popular included in the data set, but it would be worth exploring
brand_ppu_summary <- swire_df %>%
group_by(BRAND) %>%
summarise(n = n(),
avg_price_unit = mean(UNIT_PRICE),
avg_sales_dollars = mean(DOLLAR_SALES),
avg_unit_sales = mean(UNIT_SALES)
) %>%
arrange(desc(avg_price_unit)) %>%
head(10)
brand_ppu_summary
## # A tibble: 10 × 5
## BRAND n avg_price_unit avg_sales_dollars avg_unit_sales
## <fct> <int> <dbl> <dbl> <dbl>
## 1 ALL-OUT ALLIGATOR REA… 11545 35.1 101. 2.32
## 2 KEKE ENERGY ENERGY EX… 263320 15.9 423. 74.5
## 3 KEKE ENERGY ENERGY OR… 207666 13.5 155. 23.9
## 4 CUPADA ARID REMAINING 488 11.5 763. 66.4
## 5 REAL-TIME 531911 11.1 1950. 505.
## 6 MYTHICAL BEVERAGE ABG… 56240 10.9 544. 173.
## 7 MYTHICAL BEVERAGE LO … 142432 10.5 482. 127.
## 8 MEXICAN BUBBLE JOY AD… 85034 7.69 675. 206.
## 9 MYTHICAL BEVERAGE ULT… 718536 7.60 695. 206.
## 10 MYTHICAL BEVERAGE 271932 7.26 1702. 521.
ggplot(brand_ppu_summary, aes(x = BRAND )) +
geom_bar(aes(y = avg_price_unit, fill = BRAND), stat = "identity", position = "dodge") +
labs(title = "Average Price Per Unit by Brand",
x = "Brand",
y = "Price Per Unit") +
scale_fill_manual(values = c("ALL-OUT ALLIGATOR REAL-TIME" = "darkred", "KEKE ENERGY ENERGY EXTRA STRENGTH" = "red", "KEKE ENERGY ENERGY ORIGINAL" = "red", "CUPADA ARID REMAINING" = "grey", "REAL-TIME" = "grey", "MYTHICAL BEVERAGE ABGREENERUTELY ZERO" = "grey", "MYTHICAL BEVERAGE LO CARB" = "grey", "MEXICAN BUBBLE JOY ADVANTAGEOUS" = "grey", "MYTHICAL BEVERAGE ULTRA" = "grey", "MYTHICAL BEVERAGE" = "grey")) +
theme_classic() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.text.y = element_text(angle = 0, hjust = 1))
# Several products have a much higher average unit sales price than the mean ($4.50). This includes one product with and average price of $35.00 and then another tier of products with average prices between $10-$15. These products may not need to sell as many units to be able to reach a high number of total sales dollars due to the high prices associated with the products.
rm(brand_ppu_summary)
brand_summary <- swire_df %>%
group_by(BRAND) %>%
summarise(n = n(),
total_sales_dollars = sum(DOLLAR_SALES),
total_unit_sales = sum(UNIT_SALES)
) %>%
arrange(desc(total_sales_dollars)) %>%
head(10)
brand_summary
## # A tibble: 10 × 4
## BRAND n total_sales_dollars total_unit_sales
## <fct> <int> <dbl> <dbl>
## 1 BUBBLE JOY ADVANTAGEOUS 443625 1604866764. 404310942.
## 2 REAL-TIME 531911 1037072280. 268351647.
## 3 PEPPY 399458 937030445. 251184816.
## 4 PAPI 454567 816413505. 216457130.
## 5 DIET BUBBLE JOY ADVANTAGEOUS 369807 653091204. 168687013.
## 6 ELF BUBBLES 402084 649144309. 177671835.
## 7 HILL MOISTURE 505286 557565468. 167352606.
## 8 REAL-TIME EDITIONS 827868 514215183. 172718827.
## 9 MYTHICAL BEVERAGE ULTRA 718536 499397532. 147757133.
## 10 MYTHICAL BEVERAGE 271932 462803408. 141581921.
ggplot(brand_summary, aes(x = BRAND )) +
geom_bar(aes(y = total_sales_dollars, fill = BRAND), stat = "identity", position = "dodge") +
labs(title = "Total Dollar Sales by Brand",
x = "Brand",
y = "Total Dollar Sales") +
scale_fill_manual(values = c("BUBBLE JOY ADVANTAGEOUS" = "darkred", "REAL-TIME" = "red", "PEPPY" = "red", "PAPI" = "red", "DIET BUBBLE JOY ADVANTAGEOUS" = "grey", "ELF BUBBLES" = "grey", "HILL MOISTURE" = "grey", "REAL-TIME EDITIONS" = "grey", "MYTHICAL BEVERAGE ULTRA" = "grey", "MYTHICAL BEVERAGE" = "grey")) +
theme_classic() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.text.y = element_text(angle = 0, hjust = 1))
# A handful of brands exceed $1B in total sales dollars, and several more above $500M. We are obviously dealing with very large and well known brands in this data set.
rm(brand_summary)
man_avg_summary <- swire_df %>%
group_by(MANUFACTURER) %>%
summarise(n = n(),
avg_price_unit = mean(UNIT_PRICE),
avg_sales_dollars = mean(DOLLAR_SALES),
avg_unit_sales = mean(UNIT_SALES)
) %>%
arrange(desc(avg_sales_dollars))
man_avg_summary
## # A tibble: 8 × 5
## MANUFACTURER n avg_price_unit avg_sales_dollars avg_unit_sales
## <fct> <int> <dbl> <dbl> <dbl>
## 1 ALLYS 1428416 8.26 1094. 312.
## 2 COCOS 5595540 3.92 816. 229.
## 3 PONYS 2259095 6.19 656. 210.
## 4 SWIRE-CC 5763809 3.76 501. 145.
## 5 JORDYS 428170 5.45 489. 188.
## 6 JOLLYS 6921978 3.82 477. 151.
## 7 KEKES 470986 14.8 305. 52.2
## 8 BEARS 1593430 3.50 196. 51.4
ggplot(man_avg_summary, aes(x = MANUFACTURER )) +
geom_bar(aes(y = avg_sales_dollars, fill = MANUFACTURER), stat = "identity", position = "dodge") +
labs(title = "Average Dollar Sales by Manufacturer",
x = "Manufacturer",
y = "Average Dollar Sales") +
scale_fill_manual(values = c("SWIRE-CC" = "red", "JOLLYS" = "grey", "COCOS" = "grey", "ALLYS" = "grey", "PONYS" = "grey", "BEARS" = "grey", "JORDYS" = "grey", "KEKES" = "grey")) +
theme_classic() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.text.y = element_text(angle = 0, hjust = 1))
# Swire-CC is 4th in average sales when compared to other manufacturers in the data set. Their average sales are only a little more than half of the average sales of the ALLYS. Although, they rank #2 in the number of observations in the data behind Jollys. There are numerous inferences that can be made from the table above.
man_ts_summary <- swire_df %>%
group_by(MANUFACTURER) %>%
summarise(n = n(),
total_sales_dollars = sum(DOLLAR_SALES),
total_unit_sales = sum(UNIT_SALES)
) %>%
arrange(desc(total_sales_dollars))
man_ts_summary
## # A tibble: 8 × 4
## MANUFACTURER n total_sales_dollars total_unit_sales
## <fct> <int> <dbl> <dbl>
## 1 COCOS 5595540 4563306476. 1279323647.
## 2 JOLLYS 6921978 3301641671. 1043704245.
## 3 SWIRE-CC 5763809 2885435787. 834941867.
## 4 ALLYS 1428416 1562675378. 445086050.
## 5 PONYS 2259095 1481611289. 475149320.
## 6 BEARS 1593430 312718094. 81974082.
## 7 JORDYS 428170 209238104. 80589148.
## 8 KEKES 470986 143501825. 24594243.
ggplot(man_ts_summary, aes(x = MANUFACTURER )) +
geom_bar(aes(y = total_sales_dollars, fill = MANUFACTURER), stat = "identity", position = "dodge") +
labs(title = "Total Dollar Sales by Manufacturer",
x = "Manufacturer",
y = "Total Dollar Sales") +
scale_fill_manual(values = c("SWIRE-CC" = "red", "JOLLYS" = "grey", "COCOS" = "grey", "ALLYS" = "grey", "PONYS" = "grey", "BEARS" = "grey", "JORDYS" = "grey", "KEKES" = "grey")) +
theme_classic() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.text.y = element_text(angle = 0, hjust = 1))
# Swire-CC is 3rd total revenues when compared to other manufacturers. COCOS has nearly twice as many total sales as SWIRE-CC, though COCO's also has higher priced packaged items than Swire.
man_up_summary<- swire_df %>%
group_by(MANUFACTURER) %>%
summarise(n = n(),
total_sales_dollars = sum(DOLLAR_SALES),
average_unit_price = mean(UNIT_PRICE)
) %>%
arrange(desc(average_unit_price))
man_up_summary
## # A tibble: 8 × 4
## MANUFACTURER n total_sales_dollars average_unit_price
## <fct> <int> <dbl> <dbl>
## 1 KEKES 470986 143501825. 14.8
## 2 ALLYS 1428416 1562675378. 8.26
## 3 PONYS 2259095 1481611289. 6.19
## 4 JORDYS 428170 209238104. 5.45
## 5 COCOS 5595540 4563306476. 3.92
## 6 JOLLYS 6921978 3301641671. 3.82
## 7 SWIRE-CC 5763809 2885435787. 3.76
## 8 BEARS 1593430 312718094. 3.50
ggplot(man_up_summary, aes(x = MANUFACTURER )) +
geom_bar(aes(y = average_unit_price, fill = MANUFACTURER), stat = "identity", position = "dodge") +
labs(title = "Average Unit Price by Manufacturer",
x = "Manufacturer",
y = "Average Unit Price") +
scale_fill_manual(values = c("SWIRE-CC" = "red", "JOLLYS" = "grey", "COCOS" = "grey", "ALLYS" = "grey", "PONYS" = "grey", "BEARS" = "grey", "JORDYS" = "grey", "KEKES" = "grey")) +
theme_classic() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.text.y = element_text(angle = 0, hjust = 1))
# The average unit price has a large range when grouped by manufacturers, and SWIRE-CC is near the bottom of this list; since we know that Coca Cola is the largest soft drink manufacturer in the world, it isn't a surprise that the overall price per ounce of soda is likely less than its competitors.
#KEKES, ALLYS, and PONYS all do significantly less total dollar sales and total unit sales than SWIRE-CC, but all 3 are selling their units at a much higher price; this may prove to be a point worth further discovery as smaller companies that produce in smaller quantities may prove to be a good study for short term product production.
rm(man_avg_summary)
rm(man_up_summary)
rm(man_ts_summary)
man_cs_summary <- swire_df %>%
group_by(CALORIC_SEGMENT) %>%
summarise(n = n(),
avg_price_unit = mean(UNIT_PRICE),
avg_sales_dollars = mean(DOLLAR_SALES),
avg_unit_sales = mean(UNIT_SALES),
total_sales_dollars = sum(DOLLAR_SALES)
) %>%
arrange(desc(avg_sales_dollars))
man_cs_summary
## # A tibble: 2 × 6
## CALORIC_SEGMENT n avg_price_unit avg_sales_dollars avg_unit_sales
## <fct> <int> <dbl> <dbl> <dbl>
## 1 REGULAR 12286836 4.29 739. 220.
## 2 DIET/LIGHT 12174588 4.77 442. 128.
## # ℹ 1 more variable: total_sales_dollars <dbl>
swire_df %>%
group_by(CALORIC_SEGMENT) %>%
summarise(n = n(),
total_sales_dollars = sum(DOLLAR_SALES),
total_unit_sales = sum(UNIT_SALES)
) %>%
arrange(desc(total_sales_dollars))
## # A tibble: 2 × 4
## CALORIC_SEGMENT n total_sales_dollars total_unit_sales
## <fct> <int> <dbl> <dbl>
## 1 REGULAR 12286836 9081956949. 2702820349.
## 2 DIET/LIGHT 12174588 5378171675. 1562542253.
ggplot(man_cs_summary, aes(x = CALORIC_SEGMENT )) +
geom_bar(aes(y = total_sales_dollars, fill = CALORIC_SEGMENT), stat = "identity", position = "dodge") +
labs(title = "Caloric Segment by Total Sales Dollars",
x = "Caloric Segment",
y = "Total Sales Dollars") +
theme_classic() +
scale_fill_manual(values = c( "DIET/LIGHT" = "grey", "REGULAR" = "red")) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.text.y = element_text(angle = 0, hjust = 1))
# There is a large difference in total unit sales and total dollar sales between regular soft drinks and diet/light. Regular drinks show about double the unit sales and dollar sales even though total observations are split nearly down the middle 50/50 between regular and diet.
rm(man_cs_summary)
cat_summary <- swire_df %>%
group_by(CATEGORY) %>%
summarise(n = n(),
avg_price_unit = mean(UNIT_PRICE),
avg_sales_dollars = mean(DOLLAR_SALES),
avg_unit_sales = mean(UNIT_SALES),
total_sales_dollars = sum(DOLLAR_SALES)
) %>%
arrange(desc(avg_sales_dollars))
cat_summary
## # A tibble: 5 × 6
## CATEGORY n avg_price_unit avg_sales_dollars avg_unit_sales
## <fct> <int> <dbl> <dbl> <dbl>
## 1 SSD 12912281 3.91 744. 209.
## 2 ENERGY 5932087 6.60 627. 202.
## 3 SPARKLING WATER 3019064 3.74 208. 55.0
## 4 ING ENHANCED WATER 2452456 3.80 202. 80.4
## 5 COFFEE 145536 2.91 99.0 45.1
## # ℹ 1 more variable: total_sales_dollars <dbl>
swire_df %>%
group_by(CATEGORY) %>%
summarise(n = n(),
total_sales_dollars = sum(DOLLAR_SALES),
total_unit_sales = sum(UNIT_SALES)
) %>%
arrange(desc(total_sales_dollars))
## # A tibble: 5 × 4
## CATEGORY n total_sales_dollars total_unit_sales
## <fct> <int> <dbl> <dbl>
## 1 SSD 12912281 9606514926. 2696996876.
## 2 ENERGY 5932087 3718445029. 1198324442.
## 3 SPARKLING WATER 3019064 626470020. 166177968.
## 4 ING ENHANCED WATER 2452456 494293460. 197299015.
## 5 COFFEE 145536 14405189. 6564301
ggplot(cat_summary, aes(x = CATEGORY )) +
geom_bar(aes(y = total_sales_dollars, fill = CATEGORY), stat = "identity", position = "dodge") +
labs(title = "Total Sales Dollars by Category",
x = "Category",
y = "Total Sales Dollars") +
theme_classic() +
scale_fill_manual(values = c( "SPARKLING WATER" = "grey", "SSD" = "red", "ING ENHANCED WATER" = "grey", "COFFEE" = "grey", "ENERGY" = "grey")) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.text.y = element_text(angle = 0, hjust = 1))
# SSD accounts for the vast majority of total sales in both units and dollars. This is likely to be interepreted as standard soda, both diet and regular - the classics we know and love. Swire-CC generates most of their revenue from this category. Though the "Energy Drink" category may prove to be useful to use in modeling scenarios, it may prove more useful for our team to focus on SSD products when we consider how Swire-CC would launch innovation products.
rm(cat_summary)
cat_up_summary <- swire_df %>%
group_by(CATEGORY) %>%
summarise(n = n(),
avg_price_unit = mean(UNIT_PRICE),
avg_sales_dollars = mean(DOLLAR_SALES),
avg_unit_sales = mean(UNIT_SALES)
) %>%
arrange(desc(avg_price_unit))
cat_up_summary
## # A tibble: 5 × 5
## CATEGORY n avg_price_unit avg_sales_dollars avg_unit_sales
## <fct> <int> <dbl> <dbl> <dbl>
## 1 ENERGY 5932087 6.60 627. 202.
## 2 SSD 12912281 3.91 744. 209.
## 3 ING ENHANCED WATER 2452456 3.80 202. 80.4
## 4 SPARKLING WATER 3019064 3.74 208. 55.0
## 5 COFFEE 145536 2.91 99.0 45.1
ggplot(cat_up_summary, aes(x = CATEGORY )) +
geom_bar(aes(y = avg_price_unit, fill = CATEGORY), stat = "identity", position = "dodge") +
labs(title = "Average Price per Unit by Category",
x = "Category",
y = "Average Price Per Unit") +
theme_classic() +
scale_fill_manual(values = c( "SPARKLING WATER" = "grey", "SSD" = "grey", "ING ENHANCED WATER" = "grey", "COFFEE" = "grey", "ENERGY" = "red")) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.text.y = element_text(angle = 0, hjust = 1))
# Energy drinks sell at a much higher price per unit than the other categories. This means that assuming their cost to produce is not much different from the other caloric segments, producing more energy drinks would be an easy way to increase margins and could prove useful in comparing and contrasting innovation products against current sellers in this emerging category. It is also no surprise that the typical energy drink costs 2x at most locations than standard bottled soda products.
rm(cat_up_summary)
# All Brand and Package combinations
swire_df %>%
group_by(BRAND, PACKAGE) %>%
summarise(MIN_DATE = min(DATE),
MAX_DATE = max(DATE),
DURATION = MAX_DATE - MIN_DATE) %>%
arrange(desc(DURATION))
## `summarise()` has grouped output by 'BRAND'. You can override using the
## `.groups` argument.
## # A tibble: 1,720 × 5
## # Groups: BRAND [319]
## BRAND PACKAGE MIN_DATE MAX_DATE DURATION
## <fct> <fct> <date> <date> <drtn>
## 1 ABCS 12SMALL 24ONE PLA… 2020-12-05 2023-10-28 1057 da…
## 2 ABCS 12SMALL 4ONE PLAS… 2020-12-05 2023-10-28 1057 da…
## 3 ABCS 12SMALL MLT PLAST… 2020-12-05 2023-10-28 1057 da…
## 4 ALL-OUT ALLIGATOR REAL-TIME 12SMALL 24ONE CUP 2020-12-05 2023-10-28 1057 da…
## 5 ANCESTOR PEPPY 12SMALL 12ONE CUP 2020-12-05 2023-10-28 1057 da…
## 6 ANCIENT ELIXIR 12SMALL 6ONE CUP 2020-12-05 2023-10-28 1057 da…
## 7 ANCIENT ELIXIR 2L MULTI JUG 2020-12-05 2023-10-28 1057 da…
## 8 AZURE HORIZON 12SMALL 6ONE CUP 2020-12-05 2023-10-28 1057 da…
## 9 BEAUTIFUL GREENER .5L 6ONE JUG 2020-12-05 2023-10-28 1057 da…
## 10 BEAUTIFUL GREENER 12SMALL 12ONE CUP 2020-12-05 2023-10-28 1057 da…
## # ℹ 1,710 more rows
# launch DATE, end DATE, and DURATION of brand or package less than 6 months, or 26 weeks, in duration
swire_df %>%
group_by(BRAND, PACKAGE) %>%
summarise(MIN_DATE = min(DATE),
MAX_DATE = max(DATE),
DURATION = MAX_DATE - MIN_DATE) %>%
filter(DURATION < 180) %>%
filter(DURATION > 30) %>%
arrange(desc(DURATION))
## `summarise()` has grouped output by 'BRAND'. You can override using the
## `.groups` argument.
## # A tibble: 95 × 5
## # Groups: BRAND [78]
## BRAND PACKAGE MIN_DATE MAX_DATE DURATION
## <fct> <fct> <date> <date> <drtn>
## 1 CROWN ALL OTHER ONES 2023-05-06 2023-10-28 175 days
## 2 DIET BUBBLE JOY ADVANTAGEOUS 7.5SMALL 4ONE CUP 2023-05-06 2023-10-28 175 days
## 3 DIET ELF BUBBLES ZERO 7.5SMALL 4ONE CUP 2023-05-06 2023-10-28 175 days
## 4 ELF BUBBLES LYMONADE LEGACY 20SMALL MULTI JUG 2023-05-06 2023-10-28 175 days
## 5 FANTASMIC 7.5SMALL 4ONE CUP 2023-05-06 2023-10-28 175 days
## 6 FRESH GRAPEFRUIT 7.5SMALL 4ONE CUP 2023-05-06 2023-10-28 175 days
## 7 KOOL! ZERO SUGAR 12SMALL 8ONE CUP 2022-04-16 2022-10-08 175 days
## 8 KOOL! ZERO SUGAR 7.5SMALL 4ONE CUP 2023-05-06 2023-10-28 175 days
## 9 SUPER-DUPER PUNCHED 12SMALL 6ONE CUP 2023-05-06 2023-10-28 175 days
## 10 ASTRAL BEVERAGE .5L 12ONE JUG 2023-05-13 2023-10-28 168 days
## # ℹ 85 more rows
# As an initial foray into determining the characteristics of "Innovation Products" some of the data surrounding launch and packaging caught our attention. If we were to define innovation products as being in the market for at least 30 days but less than 6 months, we see there are 95 different brand packaging combinations that fit this criteria in our historic data set out of a total of 1720 Package and Brand Combinations. Does packaging matter for an innovation product?
swire_df %>%
group_by(BRAND, PACKAGE) %>%
summarise(n = n(),
MIN_DATE = min(DATE),
MAX_DATE = max(DATE),
DURATION = MAX_DATE - MIN_DATE) %>%
filter(DURATION < 100 ) %>%
filter(DURATION > 81) %>%
arrange(desc(DURATION))
## `summarise()` has grouped output by 'BRAND'. You can override using the
## `.groups` argument.
## # A tibble: 15 × 6
## # Groups: BRAND [15]
## BRAND PACKAGE n MIN_DATE MAX_DATE DURATION
## <fct> <fct> <int> <date> <date> <drtn>
## 1 HILL MOISTURE ZERO SUGAR COASTA… 12SMAL… 12 2023-07-15 2023-10-21 98 days
## 2 KOOL! ZERO SUGAR MOON 7.5SMA… 11 2022-02-26 2022-06-04 98 days
## 3 PAPI ZERO SUGAR 12SMAL… 6 2022-06-25 2022-10-01 98 days
## 4 RADIANT'S 12SMAL… 13 2022-01-22 2022-04-30 98 days
## 5 DIET HANSENIZZLE'S 12SMAL… 4 2022-11-26 2023-02-25 91 days
## 6 HILL MOISTURE ZERO SUGAR 16SMAL… 4 2021-01-09 2021-04-10 91 days
## 7 DIET BUBBLE JOY ADVANTAGEOUS 12SMAL… 4 2022-10-08 2022-12-31 84 days
## 8 DIET PEPPY 12SMAL… 3 2022-03-26 2022-06-18 84 days
## 9 ELF BUBBLES 8.5SMA… 2 2021-06-12 2021-09-04 84 days
## 10 ELF BUBBLES SUMMER SPICED MIXED… 12SMAL… 18 2022-10-29 2023-01-21 84 days
## 11 KOOL! ZERO SUGAR SHREK 7.5SMA… 3 2022-08-20 2022-11-12 84 days
## 12 MYTHICAL BEVERAGE REHAB 16SMAL… 10 2020-12-12 2021-03-06 84 days
## 13 PAPI WILD CHERRY 7.5SMA… 323 2023-08-05 2023-10-28 84 days
## 14 PEPPY ZERO SUGAR 12SMAL… 16 2023-08-05 2023-10-28 84 days
## 15 PLEASURE & LIGHT 12SMAL… 3 2021-07-17 2021-10-09 84 days
# 3 of the 7 questions Swire raised were related to 13 week launches. 13*7 = 91 days, ergo this table presents a glimpse at products on the market between 80-100 days and also released around the same time frames as Swire-CC anticipates releasing a handfull of Innovation Products.
# Load the data
#df <- readRDS("swire_no_nas.rds")
#write to csv for other tools
#write.csv(df, "swire_no_nas.csv")
# Team added a few more columns and features to the dataset for a deeper dive.
#season column based on date
#Months broken up into seasons for potential feature engineering uses, along with sales per unit.
df <- swire_df
rm(swire_df)
names(df)[names(df) == "UNIT_PRICE"] <- "SINGLE_PRICE"
df <- df %>%
mutate(MONTH = month(ymd(df$DATE)), # Extract month from the date
SEASON = case_when(
MONTH %in% c(12, 1, 2) ~ "WINTER",
MONTH %in% c(3, 4, 5) ~ "SPRING",
MONTH %in% c(6, 7, 8) ~ "SUMMER",
MONTH %in% c(9, 10, 11) ~ "FALL",
TRUE ~ NA_character_
))
#convert date to date type
df$DATE <- as.Date(df$DATE)
#factorize
df <- df %>%
mutate(BRAND = as.character(BRAND),
PACKAGE = as.character(PACKAGE)) %>%
mutate(across(c(BRAND, PACKAGE, CATEGORY, MANUFACTURER, SEASON), ~as.factor(.)))
#one hot encode CALORIC_SEGMENT as 0 or 1
df <- df %>%
mutate(across(CALORIC_SEGMENT, ~ifelse(. == "REGULAR", 1, 0)))
# Print the result
# validate data types
# After feature data type conversion, the data is ready for exploratory data analysis.
str(df)
## 'data.frame': 24461424 obs. of 13 variables:
## $ DATE : Date, format: "2021-08-21" "2022-05-07" ...
## $ MARKET_KEY : Factor w/ 200 levels "1","6","7","13",..: 98 119 117 47 45 59 48 93 138 165 ...
## $ CALORIC_SEGMENT: num 0 1 0 1 1 1 0 0 0 1 ...
## $ CATEGORY : Factor w/ 5 levels "COFFEE","ENERGY",..: 5 5 3 5 5 5 4 5 5 5 ...
## $ UNIT_SALES : num 69 4 1 3 4 112 21 3 19 57 ...
## $ DOLLAR_SALES : num 389.74 30.96 2.25 7.55 25.96 ...
## $ MANUFACTURER : Factor w/ 8 levels "ALLYS","BEARS",..: 8 3 4 3 3 8 4 4 4 4 ...
## $ BRAND : Factor w/ 319 levels "ABCS","ALL-OUT ALLIGATOR REAL-TIME",..: 93 131 96 33 257 262 11 164 241 151 ...
## $ PACKAGE : Factor w/ 103 levels ".5L 12ONE JUG",..: 14 14 69 62 14 79 38 4 14 4 ...
## $ ITEM : chr "YAWN ZERO SUGAR GENTLE DRINK SUPER-JUICE DURIAN CUP 12 LIQUID SMALL X12" "GORGEOUS SUNSET OUS GENTLE DRINK AVOCADO CUP 12 LIQUID SMALL X12" "DIGRESS ZERO NUTRIENT ENHANCED WATER BVRG PURPLE ZERO CALORIE JUG 20 LIQUID SMALL" "KOOL! RED GENTLE DRINK RED COLA CONTOUR JUG 33.8 LIQUID SMALL" ...
## $ SINGLE_PRICE : num 5.65 7.74 2.25 2.52 6.49 ...
## $ MONTH : int 8 5 10 8 1 11 3 11 7 4 ...
## $ SEASON : Factor w/ 4 levels "FALL","SPRING",..: 3 2 1 3 4 1 2 1 3 2 ...
#echo BRAND name and spacing
cat("\n DIET SMASH \n")
##
## DIET SMASH
#BRAND == DIET SMASH
df %>%
filter(BRAND == "DIET SMASH") %>%
select(MANUFACTURER,CALORIC_SEGMENT, CATEGORY, PACKAGE, ) %>%
summary()
## MANUFACTURER CALORIC_SEGMENT CATEGORY
## SWIRE-CC:17483 Min. :0 COFFEE : 0
## ALLYS : 0 1st Qu.:0 ENERGY : 0
## BEARS : 0 Median :0 ING ENHANCED WATER: 0
## COCOS : 0 Mean :0 SPARKLING WATER : 0
## JOLLYS : 0 3rd Qu.:0 SSD :17483
## JORDYS : 0 Max. :0
## (Other) : 0
## PACKAGE
## 12SMALL 12ONE CUP:11849
## 2L MULTI JUG : 5630
## 12SMALL 6ONE CUP : 4
## .5L 12ONE JUG : 0
## .5L 24ONE JUG : 0
## .5L 4ONE JUG : 0
## (Other) : 0
#echo BRAND name and spacing
cat("\n SPARKLING JACCEPTABLETLESTER \n")
##
## SPARKLING JACCEPTABLETLESTER
#BRAND == JACCEPTABLETESTER
df %>%
filter(BRAND == "SPARKLING JACCEPTABLETLESTER") %>%
select(MANUFACTURER,CALORIC_SEGMENT, CATEGORY, PACKAGE) %>%
summary()
## MANUFACTURER CALORIC_SEGMENT CATEGORY
## SWIRE-CC:299697 Min. :0.0000 COFFEE : 0
## ALLYS : 0 1st Qu.:0.0000 ENERGY : 0
## BEARS : 0 Median :1.0000 ING ENHANCED WATER: 0
## COCOS : 0 Mean :0.7275 SPARKLING WATER : 81682
## JOLLYS : 0 3rd Qu.:1.0000 SSD :218015
## JORDYS : 0 Max. :1.0000
## (Other) : 0
## PACKAGE
## 1L MULTI JUG :65127
## ALL OTHER ONES :52678
## 7.5SMALL 6ONE CUP :50855
## 10SMALL 6ONE PLASTICS JUG:35845
## 2L MULTI JUG :27326
## 20SMALL MULTI JUG :27106
## (Other) :40760
#echo BRAND name and spacing
cat("\n VENOMOUS BLAST \n")
##
## VENOMOUS BLAST
#BRAND == VENOMOUS BLAST
df %>%
filter(BRAND == "VENOMOUS BLAST") %>%
select(MANUFACTURER,CALORIC_SEGMENT, CATEGORY, PACKAGE) %>%
summary()
## MANUFACTURER CALORIC_SEGMENT CATEGORY
## SWIRE-CC:51756 Min. :0.0000 COFFEE : 0
## ALLYS : 0 1st Qu.:0.0000 ENERGY :51756
## BEARS : 0 Median :1.0000 ING ENHANCED WATER: 0
## COCOS : 0 Mean :0.7449 SPARKLING WATER : 0
## JOLLYS : 0 3rd Qu.:1.0000 SSD : 0
## JORDYS : 0 Max. :1.0000
## (Other) : 0
## PACKAGE
## 16SMALL MULTI CUP:51728
## 8SMALL MULTI CUP : 19
## 16SMALL MULTI JUG: 9
## .5L 12ONE JUG : 0
## .5L 24ONE JUG : 0
## .5L 4ONE JUG : 0
## (Other) : 0
#echo BRAND name and spacing
cat("\n SQUARE \n")
##
## SQUARE
#BRAND == SQUARE
df %>%
filter(BRAND == "SQUARE") %>%
select(MANUFACTURER,CALORIC_SEGMENT, CATEGORY, PACKAGE) %>%
summary()
## MANUFACTURER CALORIC_SEGMENT CATEGORY
## SWIRE-CC:7017 Min. :0.0000 COFFEE : 0
## ALLYS : 0 1st Qu.:1.0000 ENERGY : 0
## BEARS : 0 Median :1.0000 ING ENHANCED WATER: 0
## COCOS : 0 Mean :0.7881 SPARKLING WATER :7015
## JOLLYS : 0 3rd Qu.:1.0000 SSD : 2
## JORDYS : 0 Max. :1.0000
## (Other) : 0
## PACKAGE
## 20SMALL MULTI JUG :6641
## ALL OTHER ONES : 347
## 2L MULTI JUG : 27
## .5L MLT SHADYES JUG: 1
## 1.5L MULTI JUG : 1
## .5L 12ONE JUG : 0
## (Other) : 0
#echo BRAND name and spacing
cat("\n GREETINGLE \n")
##
## GREETINGLE
#BRAND == GREETINGLE
df %>%
filter(BRAND == "GREETINGLE") %>%
select(MANUFACTURER,CALORIC_SEGMENT, CATEGORY, PACKAGE) %>%
summary()
## MANUFACTURER CALORIC_SEGMENT CATEGORY
## SWIRE-CC:491300 Min. :0 COFFEE : 0
## ALLYS : 0 1st Qu.:0 ENERGY : 0
## BEARS : 0 Median :0 ING ENHANCED WATER:491300
## COCOS : 0 Mean :0 SPARKLING WATER : 0
## JOLLYS : 0 3rd Qu.:0 SSD : 0
## JORDYS : 0 Max. :0
## (Other) : 0
## PACKAGE
## 18SMALL MULTI JUG:373131
## 18SMALL 6ONE : 86750
## .5L 6ONE JUG : 23207
## ALL OTHER ONES : 7845
## .5L 12ONE JUG : 367
## .5L 24ONE JUG : 0
## (Other) : 0
#echo BRAND name and spacing
cat("\n DIET MOONLIT \n")
##
## DIET MOONLIT
#BRAND == DIET MOONLIT
df %>%
filter(BRAND == "DIET MOONLIT") %>%
select(MANUFACTURER,CALORIC_SEGMENT, CATEGORY, PACKAGE) %>%
summary()
## MANUFACTURER CALORIC_SEGMENT CATEGORY
## SWIRE-CC:75948 Min. :0 COFFEE : 0
## ALLYS : 0 1st Qu.:0 ENERGY : 0
## BEARS : 0 Median :0 ING ENHANCED WATER: 0
## COCOS : 0 Mean :0 SPARKLING WATER : 0
## JOLLYS : 0 3rd Qu.:0 SSD :75948
## JORDYS : 0 Max. :0
## (Other) : 0
## PACKAGE
## 2L MULTI JUG :28140
## 12SMALL 12ONE CUP:28006
## .5L 6ONE JUG :12023
## 20SMALL MULTI JUG: 7684
## 12SMALL 6ONE CUP : 94
## ALL OTHER ONES : 1
## (Other) : 0
#echo BRAND name and spacing
cat("\n PEPPY \n")
##
## PEPPY
#BRAND == PEPPY
df %>%
filter(BRAND == "PEPPY") %>%
select(MANUFACTURER,CALORIC_SEGMENT, CATEGORY, PACKAGE) %>%
summary()
## MANUFACTURER CALORIC_SEGMENT CATEGORY
## SWIRE-CC:399458 Min. :1 COFFEE : 0
## ALLYS : 0 1st Qu.:1 ENERGY : 0
## BEARS : 0 Median :1 ING ENHANCED WATER: 0
## COCOS : 0 Mean :1 SPARKLING WATER : 0
## JOLLYS : 0 3rd Qu.:1 SSD :399458
## JORDYS : 0 Max. :1
## (Other) : 0
## PACKAGE
## 20SMALL MULTI JUG: 34546
## 2L MULTI JUG : 29397
## .5L 6ONE JUG : 29396
## 12SMALL 12ONE CUP: 29396
## 7.5SMALL 6ONE CUP: 29389
## 1L MULTI JUG : 29300
## (Other) :218034
Comments: The Diet Smash product is a diet product, in the energy category, and comes in 3 packaging types (Innovation - Packaging?). The Sparkling Jacceptabletlester brand comes in both diet/regular, straddles both sparkling water and sparkling soda drink categories, and comes in multiple different package types (but not innovation package “16small multi cup”). The Venomous Blast product comes in both diet and regular, in the energy category product, and comes in 3 packaging types (almost exclusively “16small multi cup”, with two short term release sizes, but not future “innovation package?” “16 liquid small”). The Square brand comes in both diet and regular, in the 2 categories (sparkling water and ssd), and comes in 5 packaging types (several have extremely small counts - full innovation packaging?). The Greetingle brand comes in only diet, in the ING Enhanced Water category, and comes in 6 packaging types (all with relatively legit numbers, with the exception of one size). The Diet Moonlit brand comes in only diet, in the ssd category, and comes in multiple packaging types (all with legit numbers). The Peppy brand comes in only regular, in the ssd category, and comes in at least 6 packaging types (all with legit numbers, other could be explored more).
#sales in thousands by manufacturer
df %>%
group_by(MANUFACTURER) %>%
summarise(TOTAL_SALES = sum(DOLLAR_SALES)) %>%
arrange(desc(TOTAL_SALES))
## # A tibble: 8 × 2
## MANUFACTURER TOTAL_SALES
## <fct> <dbl>
## 1 COCOS 4563306476.
## 2 JOLLYS 3301641671.
## 3 SWIRE-CC 2885435787.
## 4 ALLYS 1562675378.
## 5 PONYS 1481611289.
## 6 BEARS 312718094.
## 7 JORDYS 209238104.
## 8 KEKES 143501825.
#graph sales in thousands by manufacturer
df %>%
group_by(MANUFACTURER) %>%
summarise(TOTAL_SALES = sum(DOLLAR_SALES/1000)) %>%
ggplot(aes(x = reorder(MANUFACTURER, -TOTAL_SALES), y = TOTAL_SALES)) +
geom_col() +
coord_flip() +
labs(title = "Total Sales in Thousands by Manufacturer",
x = "Manufacturer",
y = "$ Sales")
#sales in thousands by top 10 package size
df %>%
group_by(PACKAGE) %>%
summarise(TOTAL_SALES = sum(DOLLAR_SALES/1000)) %>%
arrange(desc(TOTAL_SALES)) %>%
head(10)
## # A tibble: 10 × 2
## PACKAGE TOTAL_SALES
## <fct> <dbl>
## 1 12SMALL 12ONE CUP 3422060.
## 2 20SMALL MULTI JUG 1814841.
## 3 16SMALL MULTI CUP 1621832.
## 4 .5L 6ONE JUG 1089477.
## 5 12SMALL 24ONE CUP 1007780.
## 6 2L MULTI JUG 830755.
## 7 12SMALL MULTI CUP 758168.
## 8 12SMALL 8ONE CUP 422976.
## 9 ALL OTHER ONES 291378.
## 10 12SMALL 8ONE SHADYES JUG 274115.
#graph sales in thousands by top 10 ten package size
df %>%
group_by(PACKAGE) %>%
summarise(TOTAL_SALES = sum(DOLLAR_SALES/1000)) %>%
arrange(desc(TOTAL_SALES)) %>%
head(10) %>%
ggplot(aes(x = reorder(PACKAGE, -TOTAL_SALES), y = TOTAL_SALES)) +
geom_col() +
coord_flip() +
labs(title = "Top 10 Sales in Thousands by Package Size",
x = "Package Size",
y = "$ Sales")
#bottom 10 sales in throusands by package size
df %>%
group_by(PACKAGE) %>%
summarise(TOTAL_SALES = sum(DOLLAR_SALES)) %>%
arrange(TOTAL_SALES) %>%
head(10)
## # A tibble: 10 × 2
## PACKAGE TOTAL_SALES
## <fct> <dbl>
## 1 8.55SMALL MLT SHADYES JUG 2.1
## 2 20SMALL 15ONE JUG 6.49
## 3 8.55SMALL 6ONE SHADYES JUG 8
## 4 24SMALL 4ONE JUG 9.94
## 5 .5L 4ONE JUG 17.3
## 6 1L 12ONE JUG 18.5
## 7 12SMALL 8ONE JUG 27.3
## 8 20SMALL 12ONE SHADYES JUG 30.0
## 9 12SMALL 32ONE CUP 35.5
## 10 16SMALL MULTI JUG 36.7
#graph sales in thousands by bottom 10 package size
df %>%
group_by(PACKAGE) %>%
summarise(TOTAL_SALES = sum(DOLLAR_SALES)) %>%
arrange(TOTAL_SALES) %>%
head(10) %>%
ggplot(aes(x = reorder(PACKAGE, TOTAL_SALES), y = TOTAL_SALES)) +
geom_col() +
coord_flip() +
labs(title = "Bottom 10 Sales by Package Size",
x = "Package Size",
y = "$ Sales")
#top 10 sales in thousands by brand
df %>%
group_by(BRAND) %>%
summarise(TOTAL_SALES = sum(DOLLAR_SALES/1000)) %>%
arrange(desc(TOTAL_SALES)) %>%
head(10)
## # A tibble: 10 × 2
## BRAND TOTAL_SALES
## <fct> <dbl>
## 1 BUBBLE JOY ADVANTAGEOUS 1604867.
## 2 REAL-TIME 1037072.
## 3 PEPPY 937030.
## 4 PAPI 816414.
## 5 DIET BUBBLE JOY ADVANTAGEOUS 653091.
## 6 ELF BUBBLES 649144.
## 7 HILL MOISTURE 557565.
## 8 REAL-TIME EDITIONS 514215.
## 9 MYTHICAL BEVERAGE ULTRA 499398.
## 10 MYTHICAL BEVERAGE 462803.
#graph sales in thousands by top 10 brand
df %>%
group_by(BRAND) %>%
summarise(TOTAL_SALES = sum(DOLLAR_SALES/1000)) %>%
arrange(desc(TOTAL_SALES)) %>%
head(10) %>%
ggplot(aes(x = reorder(BRAND, -TOTAL_SALES), y = TOTAL_SALES)) +
geom_col() +
coord_flip() +
labs(title = "Top 10 Sales in Thousands by Brand",
x = "Brand",
y = "$ Sales")
#bottom 10 sales in thousands by brand
df %>%
group_by(BRAND) %>%
summarise(TOTAL_SALES = sum(DOLLAR_SALES)) %>%
arrange(TOTAL_SALES) %>%
head(10)
## # A tibble: 10 × 2
## BRAND TOTAL_SALES
## <fct> <dbl>
## 1 MIST TWST NAT CBRY 0.99
## 2 TIPCAL CANARY TWISTER SODA 1.19
## 3 VAULTLESS 1.69
## 4 DIET MIST TWST MIXED-FRUIT 2
## 5 ONLY ORGANIC 2.29
## 6 DIET BUBBLE JOY ADVANTAGEOUS BLUEBERRY ACAI 2.86
## 7 SUPER-DUPER SPARKLING 4.5
## 8 GORGEOUS ORANGEOUS CHERRY 5.05
## 9 MAXIMUM RED BERRY 5.68
## 10 PAPI CHRY/VAN REAL SUGAR 5.97
#graph sales in thousands by bottom 10 brand
df %>%
group_by(BRAND) %>%
summarise(TOTAL_SALES = sum(DOLLAR_SALES)) %>%
arrange(TOTAL_SALES) %>%
head(10) %>%
ggplot(aes(x = reorder(BRAND, TOTAL_SALES), y = TOTAL_SALES)) +
geom_col() +
coord_flip() +
labs(title = "Bottom 10 Sales by Brand",
x = "Brand",
y = "$ Sales")
#sales in thousands by category
df %>%
group_by(CATEGORY) %>%
summarise(TOTAL_SALES = sum(DOLLAR_SALES/1000)) %>%
arrange(desc(TOTAL_SALES))
## # A tibble: 5 × 2
## CATEGORY TOTAL_SALES
## <fct> <dbl>
## 1 SSD 9606515.
## 2 ENERGY 3718445.
## 3 SPARKLING WATER 626470.
## 4 ING ENHANCED WATER 494293.
## 5 COFFEE 14405.
#graph sales in thousands by CATEGORY
df %>%
group_by(CATEGORY) %>%
summarise(TOTAL_SALES = sum(DOLLAR_SALES/1000)) %>%
ggplot(aes(x = reorder(CATEGORY, -TOTAL_SALES), y = TOTAL_SALES)) +
geom_col() +
coord_flip() +
labs(title = "Total Sales in Thousands by Category",
x = "Category",
y = "$ Sales")
#sales in thousands by season
df %>%
group_by(SEASON) %>%
summarise(TOTAL_SALES = sum(DOLLAR_SALES/1000)) %>%
arrange(desc(TOTAL_SALES))
## # A tibble: 4 × 2
## SEASON TOTAL_SALES
## <fct> <dbl>
## 1 SPRING 3874929.
## 2 SUMMER 3865217.
## 3 WINTER 3616622.
## 4 FALL 3103362.
#graph sales by SEASON
df %>%
group_by(SEASON) %>%
summarise(TOTAL_SALES = sum(DOLLAR_SALES/1000)) %>%
ggplot(aes(x = reorder(SEASON, -TOTAL_SALES), y = TOTAL_SALES)) +
geom_col() +
coord_flip() +
labs(title = "Total Sales in Thousands by Season",
x = "Season",
y = "$ Sales")
COMMENTS: Swire is in 3rd place for overall sales behind Jollys and Cocos for overall sales by manufacturer. The top 3 package sizes range from 12small 12one cup, 20small multi jug, and 16small multi cup. Bottom 3 sales by package size are 8.55small mlt shadyes jug, 20 small 15one jug, and 8.55small 6one shadyes jug. Bottom packges sizes are likely innovation package that did not do so well based on extremely small sales (<$40K). Bubble Joy Advantageous (a Coco’s regular soda) is a clear winner in sales by brand, followed by real-time (Ally’s primarily energy drink in diet/regular), and peppy (Swire-CC’s regular soda). Bottom 10 sales by brand all falls into extremely small buckets of sales less than $6k, are these innovation product failures? The bottom three are Mist Twst Nat Cbry (Jolly’s 1 single sale of regular soda), Tipcal Canary Twister Soda (1 single sale of Jolly’s regular soda), and Vaultless (1 single sale of Coco’s diet soda). The sparkling soda drink category is more than double the next (energy). ING Enhanced Water and Sparkling Water register as notable bottled drinks, but coffee barely scratches the surface in terms of sales. Summer and Spring are roughly about the same in terms of sales, followed by a slight drop in Winter, and a more noticeable drop for Fall.
#top 10 longest running brands
df %>%
group_by(BRAND) %>%
summarise(LENGTH = n_distinct(DATE)) %>%
arrange(desc(LENGTH)) %>%
head(10)
## # A tibble: 10 × 2
## BRAND LENGTH
## <fct> <int>
## 1 CARBONATE STREAM 152
## 2 CUPADA ARID 152
## 3 RADIANT'S 152
## 4 SINGLE GROUP 152
## 5 SPARKLING JACCEPTABLETLESTER 152
## 6 BUBBLE JOY 148
## 7 CARBONATE STREAM WATERS 148
## 8 CROWN 148
## 9 DIGRESS FLAVORED 148
## 10 EXCLAMATION SODA 148
#shortest 10 running brands
df %>%
group_by(BRAND) %>%
summarise(LENGTH = n_distinct(DATE)) %>%
arrange(LENGTH) %>%
head(10)
## # A tibble: 10 × 2
## BRAND LENGTH
## <fct> <int>
## 1 LAUGHING MYTHICAL BEVERAGE CHAI HAI 1
## 2 MIST TWST NAT CBRY 1
## 3 ONLY ORGANIC 1
## 4 PAPI VANILLA REAL SUGAR 1
## 5 TIPCAL CANARY TWISTER SODA 1
## 6 VAULTLESS 1
## 7 DIET BUBBLE JOY ADVANTAGEOUS BLUEBERRY ACAI 2
## 8 DIET BUBBLE JOY ADVANTAGEOUS PLANT-BASED SWEENTENERS 2
## 9 DIET MIST TWST MIXED-FRUIT 2
## 10 DIET MUTANT 2
#median length of time for a brand
df %>%
group_by(BRAND) %>%
summarise(LENGTH = n_distinct(DATE)) %>%
summarise(median(LENGTH))
## # A tibble: 1 × 1
## `median(LENGTH)`
## <int>
## 1 137
#mean length of time for a brand
df %>%
group_by(BRAND) %>%
summarise(LENGTH = n_distinct(DATE)) %>%
summarise(mean(LENGTH))
## # A tibble: 1 × 1
## `mean(LENGTH)`
## <dbl>
## 1 99.1
#density plot of brand run time
df %>%
group_by(BRAND) %>%
summarise(LENGTH = n_distinct(DATE)) %>%
ggplot(aes(x = LENGTH)) +
geom_density() +
labs(title = "Density Plot of Brand Run Time",
x = "Weeks",
y = "Frequency")
#brands that run for less than 6 months
df %>%
group_by(BRAND) %>%
summarise(LENGTH = n_distinct(DATE)) %>%
filter(LENGTH < 26)
## # A tibble: 63 × 2
## BRAND LENGTH
## <fct> <int>
## 1 BARS 4
## 2 BUBBLE JOY ADVANTAGEOUS W/LIME 14
## 3 CLEAR RADIANCE PAPI 24
## 4 CUPADA ARID REMAINING 24
## 5 DIET BUBBLE JOY ADVANTAGEOUS BLUEBERRY ACAI 2
## 6 DIET BUBBLE JOY ADVANTAGEOUS PLANT-BASED SWEENTENERS 2
## 7 DIET BUBBLE JOY ADVANTAGEOUS STRAWBERRY GUAVA 3
## 8 DIET BUBBLE JOY ADVANTAGEOUS W/LIME 21
## 9 DIET DROPTOP 10
## 10 DIET HILL MOISTURE ELECTRICITY 21
## # ℹ 53 more rows
#summarize features of brands that run for less than 6 months
df %>%
group_by(BRAND) %>%
summarise(LENGTH = n_distinct(DATE)) %>%
filter(LENGTH < 6) %>%
left_join(df, by = "BRAND") %>%
select(BRAND, CATEGORY, SEASON, PACKAGE, MANUFACTURER) %>%
distinct()
## # A tibble: 61 × 5
## BRAND CATEGORY SEASON PACKAGE MANUFACTURER
## <fct> <fct> <fct> <fct> <fct>
## 1 BARS SSD FALL 2L MUL… COCOS
## 2 DIET BUBBLE JOY ADVANTAGEOUS BLUEBERRY … SSD SUMMER 12SMAL… COCOS
## 3 DIET BUBBLE JOY ADVANTAGEOUS BLUEBERRY … SSD WINTER 12SMAL… COCOS
## 4 DIET BUBBLE JOY ADVANTAGEOUS PLANT-BASE… SSD SUMMER 7.5SMA… COCOS
## 5 DIET BUBBLE JOY ADVANTAGEOUS STRAWBERRY… SSD SPRING 12SMAL… COCOS
## 6 DIET BUBBLE JOY ADVANTAGEOUS STRAWBERRY… SSD WINTER 12SMAL… COCOS
## 7 DIET BUBBLE JOY ADVANTAGEOUS STRAWBERRY… SSD WINTER 12SMAL… COCOS
## 8 DIET MIST TWST MIXED-FRUIT SSD WINTER 2L MUL… JOLLYS
## 9 DIET MIST TWST MIXED-FRUIT SSD SUMMER 2L MUL… JOLLYS
## 10 DIET MUTANT SSD SPRING 20SMAL… PONYS
## # ℹ 51 more rows
COMMENT: The top 10 brands that run 148 or 152 weeks do not include some of the top sales by brands (are there missing weeks?). The top 10 shortest running brands include bottom brands by sales and only ran for 1 or two weeks, which makes sense if something only registered one single sale. The median length of a brand is 137 weeks, with mean brand run time falling in a 99.1 weeks. Therefore, the data likely exhibits left skewness, indicating that there are some brands with very short run times (which pull down the mean), while the median is higher due to the influence of some brands with longer run times. The histogram of brand run time shows a left skew with several stalwart brands running for a very long time, and a lot of brands, including innovation types, running for a shorter duration. As can be seen in the density plot there are two main humps or modes, one that clusters between 0 and ~25 weeks (6 months) trending down to a flatter line between week 50 and week 100, and another much larger cluster presumably dominated by the always on the shlef types cluster between 125 and 152 weeks. It may be worth looking for week 0/launch/tenure date spikes for those products that runs less than 6 months in order to determine start/stop times for any ARIMA/Time-Series models later.
#top 10 longest running packages
df %>%
group_by(PACKAGE) %>%
summarise(LENGTH = n_distinct(DATE)) %>%
arrange(desc(LENGTH)) %>%
head(10)
## # A tibble: 10 × 2
## PACKAGE LENGTH
## <fct> <int>
## 1 .5L 12ONE JUG 152
## 2 .5L 24ONE JUG 152
## 3 .5L 6ONE JUG 152
## 4 12SMALL 12ONE CUP 152
## 5 12SMALL 15ONE CUP 152
## 6 12SMALL 24ONE CUP 152
## 7 12SMALL 6ONE CUP 152
## 8 16SMALL MULTI CUP 152
## 9 1L MULTI JUG 152
## 10 20SMALL 24ONE JUG 152
#shortest 10 running packages
df %>%
group_by(PACKAGE) %>%
summarise(LENGTH = n_distinct(DATE)) %>%
arrange(LENGTH) %>%
head(10)
## # A tibble: 10 × 2
## PACKAGE LENGTH
## <fct> <int>
## 1 1L 12ONE JUG 1
## 2 20SMALL 15ONE JUG 1
## 3 8.55SMALL 6ONE SHADYES JUG 1
## 4 24SMALL 4ONE JUG 2
## 5 20SMALL 12ONE SHADYES JUG 3
## 6 8.55SMALL MLT SHADYES JUG 3
## 7 .5L 4ONE JUG 5
## 8 12SMALL 32ONE CUP 6
## 9 12SMALL 8ONE JUG 6
## 10 .5L 8ONE SHADYES JUG 7
#median length of time for a package
df %>%
group_by(PACKAGE) %>%
summarise(LENGTH = n_distinct(DATE)) %>%
summarise(median(LENGTH))
## # A tibble: 1 × 1
## `median(LENGTH)`
## <int>
## 1 147
#mean length of time for a package
df %>%
group_by(PACKAGE) %>%
summarise(LENGTH = n_distinct(DATE)) %>%
summarise(mean(LENGTH))
## # A tibble: 1 × 1
## `mean(LENGTH)`
## <dbl>
## 1 117.
#density plot of package run time
df %>%
group_by(PACKAGE) %>%
summarise(LENGTH = n_distinct(DATE)) %>%
ggplot(aes(x = LENGTH)) +
geom_density() +
labs(title = "Density Plot of Package Run Time",
x = "Weeks",
y = "Frequency")
#packages that run for less than 6 months
df %>%
group_by(PACKAGE) %>%
summarise(LENGTH = n_distinct(DATE)) %>%
filter(LENGTH < 26)
## # A tibble: 14 × 2
## PACKAGE LENGTH
## <fct> <int>
## 1 .5L 4ONE JUG 5
## 2 .5L 8ONE SHADYES JUG 7
## 3 12SMALL 12ONE BUMPY CUP 17
## 4 12SMALL 32ONE CUP 6
## 5 12SMALL 8ONE JUG 6
## 6 16SMALL MULTI JUG 12
## 7 1L 12ONE JUG 1
## 8 20SMALL 12ONE SHADYES JUG 3
## 9 20SMALL 15ONE JUG 1
## 10 22SMALL MULTI JUG 15
## 11 24SMALL 4ONE JUG 2
## 12 7.5SMALL 10ONE 12
## 13 8.55SMALL 6ONE SHADYES JUG 1
## 14 8.55SMALL MLT SHADYES JUG 3
#summarize features of packages that run for less than 6 months
df %>%
group_by(PACKAGE) %>%
summarise(LENGTH = n_distinct(DATE)) %>%
filter(LENGTH < 6) %>%
left_join(df, by = "PACKAGE") %>%
select(PACKAGE, CATEGORY, SEASON, BRAND, MANUFACTURER) %>%
distinct()
## # A tibble: 15 × 5
## PACKAGE CATEGORY SEASON BRAND MANUFACTURER
## <fct> <fct> <fct> <fct> <fct>
## 1 .5L 4ONE JUG ING ENHANCED WATER SUMMER VITAMINAL … COCOS
## 2 .5L 4ONE JUG ING ENHANCED WATER WINTER VITAMINAL … COCOS
## 3 .5L 4ONE JUG ING ENHANCED WATER FALL VITAMINAL … COCOS
## 4 .5L 4ONE JUG ING ENHANCED WATER SPRING VITAMINAL … COCOS
## 5 1L 12ONE JUG SPARKLING WATER WINTER INTELLIGEN… COCOS
## 6 20SMALL 12ONE SHADYES JUG SSD WINTER DIET BUBBL… COCOS
## 7 20SMALL 12ONE SHADYES JUG SSD SUMMER DIET BUBBL… COCOS
## 8 20SMALL 12ONE SHADYES JUG SSD SPRING DIET BUBBL… COCOS
## 9 20SMALL 15ONE JUG ING ENHANCED WATER WINTER VITAMINAL … COCOS
## 10 24SMALL 4ONE JUG SSD SPRING HILL MOIST… JOLLYS
## 11 24SMALL 4ONE JUG SSD SUMMER HILL MOIST… JOLLYS
## 12 8.55SMALL 6ONE SHADYES JUG SSD FALL DIET BUBBL… COCOS
## 13 8.55SMALL MLT SHADYES JUG SSD FALL FANTASMIC COCOS
## 14 8.55SMALL MLT SHADYES JUG SSD SPRING FANTASMIC COCOS
## 15 8.55SMALL MLT SHADYES JUG SSD WINTER FANTASMIC COCOS
COMMENT: Top 10 packages run the entire length of the dataset, which is 152 weeks. These are tried and true packaging sizes that we have all likely grown up with, know and love to enjoy our beverages from, whether from a gas station or grocery store. The top 10 shortest packages range from 1 week to 7 weeks. The median package run length is 147, indicating that tried and true packing overwhelmingly dominates distribution sales. The mean package tenure is 117 weeks. Therefore, the data likely exhibits right skewness, indicating that there are some packages with very high tenures (which push up the median), but the mean is lower due to the influence of some packages with shorter tenures. The density plot shows two primary modes, one smaller < 12 weeks, likely innovation dominated, and another larger > 140 weeks, legacy package sizes. There are 14 package sizes that run for less than 6 months or 26 weeks.
#graph DOLLAR_SALES by DATE for BRAND == "DIET SMASH"
df %>%
filter(BRAND == "DIET SMASH") %>%
ggplot(aes(x = DATE, y = DOLLAR_SALES)) +
geom_line() +
labs(title = "Sales for Diet Smash",
x = "Date",
y = "$ Sales")
#graph DOLLAR_SALES by DATE for BRAND == "DIET SMASH" - INNOVATION PACKAGE == "2L MULTI JUG"
df %>%
filter(BRAND == "DIET SMASH", PACKAGE == "2L MULTI JUG") %>%
ggplot(aes(x = DATE, y = DOLLAR_SALES)) +
geom_line() +
labs(title = "Sales for Diet Smash 2L Multi Jug",
x = "Date",
y = "$ Sales")
#graph DOLLAR_SALES by DATE for BRAND == "SPARKLING JACCEPTABLETLESTER",
#CATEGORY == "SSD", CALORIC_SEGMENT == REGULAR
df %>%
filter(BRAND == "SPARKLING JACCEPTABLETLESTER", CATEGORY == "SSD",
CALORIC_SEGMENT == "1" ) %>%
ggplot(aes(x = DATE, y = DOLLAR_SALES)) +
geom_line() +
labs(title = "Sales for Regular Sparkling Jacceptabletlester Soft Drink",
x = "Date",
y = "$ Sales")
#graph DOLLAR_SALES by DATE for BRAND == "VENOMOUS BLAST" and CATEGORY == DIET/LIGHT
df %>%
filter(BRAND == "VENOMOUS BLAST", CALORIC_SEGMENT == 0) %>%
ggplot(aes(x = DATE, y = DOLLAR_SALES)) +
geom_line() +
labs(title = "Sales for Venomous Blast",
x = "Date",
y = "$ Sales")
#graph DOLLAR_SALES by DATE for BRAND == "SQUARE"
df %>%
filter(BRAND == "SQUARE") %>%
ggplot(aes(x = DATE, y = DOLLAR_SALES)) +
geom_line() +
labs(title = "Sales for Square",
x = "Date",
y = "$ Sales")
#graph DOLLAR_SALES by DATE for BRAND == "GREETINGLE"
df %>%
filter(BRAND == "GREETINGLE") %>%
ggplot(aes(x = DATE, y = DOLLAR_SALES)) +
geom_line() +
labs(title = "Sales for Greetingle",
x = "Date",
y = "$ Sales")
#graph DOLLAR_SALES by DATE for BRAND == "DIET MOONLIT"
df %>%
filter(BRAND == "DIET MOONLIT") %>%
ggplot(aes(x = DATE, y = DOLLAR_SALES)) +
geom_line() +
labs(title = "Sales for Diet Moonlit",
x = "Date",
y = "$ Sales")
#graph DOLLAR_SALES by DATE for BRAND == "PEPPY"
df %>%
filter(BRAND == "PEPPY") %>%
ggplot(aes(x = DATE, y = DOLLAR_SALES)) +
geom_line() +
labs(title = "Sales for Peppy",
x = "Date",
y = "$ Sales")
COMMENT: Some seasonlity is observed in all drinks analyzed, with most including missing weeks and spikes. Are missing weeks, missing data, end of product lifecycle, or something else? The missing period seems common to several products analyzed.
#What are the common gaps between DATE where there is no weekly data for BRAND == "DIET SMASH"
df %>%
filter(BRAND == "DIET SMASH") %>%
arrange(DATE) %>%
mutate(DIFF = DATE - lag(DATE)) %>%
filter(DIFF > 7) %>%
count(DIFF) %>%
arrange(desc(n)) %>%
head(20)
## DIFF n
## 1 42 days 1
#What are the common gaps between DATE where there is no weekly data for BRAND == "DIET SMASH" and PACKAGE == "2L MULTI JUG"
df %>%
filter(BRAND == "DIET SMASH", PACKAGE == "2L MULTI JUG") %>%
arrange(DATE) %>%
mutate(DIFF = DATE - lag(DATE)) %>%
filter(DIFF > 7) %>%
count(DIFF) %>%
arrange(desc(n)) %>%
head(20)
## DIFF n
## 1 35 days 1
#What are the common gaps between DATE where there is no weekly data for BRAND == "SPARKLING JACCEPTABLETLESTER",
#CATEGORY == "SSD", CALORIC_SEGMENT == REGULAR
df %>%
filter(BRAND == "SPARKLING JACCEPTABLETLESTER", CATEGORY == "SSD",
CALORIC_SEGMENT == 1) %>%
arrange(DATE) %>%
mutate(DIFF = DATE - lag(DATE)) %>%
filter(DIFF > 7) %>%
count(DIFF) %>%
arrange(desc(n)) %>%
head(20)
## DIFF n
## 1 42 days 1
#What are the common gaps between DATE where there is no weekly data for BRAND == "VENOMOUS BLAST" and CALORIC_SEGMENT == DIET/LIGHT
df %>%
filter(BRAND == "VENOMOUS BLAST", CALORIC_SEGMENT == 0 ) %>%
arrange(DATE) %>%
mutate(DIFF = DATE - lag(DATE)) %>%
filter(DIFF > 7) %>%
count(DIFF) %>%
arrange(desc(n)) %>%
head(20)
## DIFF n
## 1 35 days 1
## 2 70 days 1
#What are the common gaps between DATE where there is no weekly data for BRAND == "SQUARE"
df %>%
filter(BRAND == "SQUARE") %>%
arrange(DATE) %>%
mutate(DIFF = DATE - lag(DATE)) %>%
filter(DIFF > 7) %>%
count(DIFF) %>%
arrange(desc(n)) %>%
head(20)
## DIFF n
## 1 35 days 3
## 2 14 days 2
## 3 28 days 2
## 4 21 days 1
## 5 70 days 1
## 6 77 days 1
## 7 126 days 1
#What are the common gaps between DATE where there is no weekly data for BRAND == "GREETINGLE"
df %>%
filter(BRAND == "GREETINGLE") %>%
arrange(DATE) %>%
mutate(DIFF = DATE - lag(DATE)) %>%
filter(DIFF > 7) %>%
count(DIFF) %>%
arrange(desc(n)) %>%
head(20)
## DIFF n
## 1 35 days 1
#What are the common gaps between DATE where there is no weekly data for BRAND == "DIET MOONLIT"
df %>%
filter(BRAND == "DIET MOONLIT") %>%
arrange(DATE) %>%
mutate(DIFF = DATE - lag(DATE)) %>%
filter(DIFF > 7) %>%
count(DIFF) %>%
arrange(desc(n)) %>%
head(20)
## DIFF n
## 1 42 days 1
#What are the common gaps between DATE where there is no weekly data for BRAND == "PEPPY"
df %>%
filter(BRAND == "PEPPY") %>%
arrange(DATE) %>%
mutate(DIFF = DATE - lag(DATE)) %>%
filter(DIFF > 7) %>%
count(DIFF) %>%
arrange(desc(n)) %>%
head(20)
## DIFF n
## 1 42 days 1
COMMENT: The “Diet Smash” brand has a gap in dates of 42 days. If packaging such as “2L Multi Jug” is added in, Diet Smash has a gap of 35 days. The “regular Sparkling Jacceptabletler brand in the ssd category” has a gap of 42 days. The “Venomous Blast” brand has 2 gaps, one of 35 days and the other of 70 days. The “Square” brand has 7 gap lengths in dates, with the most common of 35 days occuring 3 times, and the longest being 126 days. The “Greetingle” brand has a single gap length of 35 days. The “Diet Moonlit” has a single gap of 42 days. The “Peppy” brand has a single gap of 42 days. Further analysis should be done to ensure start/stop dates for tenure are accurate and that we are not missing weekly data. The question is whether or not this constitutes missing data OR product taken off market and put back into the market based on supply, supply chain, or other business issues.
#most common launch DATE, end DATE, and TENURE of brand or package less than 6 months, or 26 weeks, in duration
df %>%
group_by(BRAND, PACKAGE) %>%
summarise(MIN_DATE = min(DATE),
MAX_DATE = max(DATE),
TENURE = MAX_DATE - MIN_DATE) %>%
filter(TENURE < 26) %>%
arrange(desc(TENURE))
## # A tibble: 145 × 5
## # Groups: BRAND [102]
## BRAND PACKAGE MIN_DATE MAX_DATE TENURE
## <fct> <fct> <date> <date> <drtn>
## 1 BARS 2L MUL… 2022-10-29 2022-11-19 21 da…
## 2 BUBBLE JOY ADVANTAGEOUS MOON 7.5SMA… 2022-02-26 2022-03-19 21 da…
## 3 JUICY SQUIRREL 3L MUL… 2023-10-07 2023-10-28 21 da…
## 4 KOOL! READY-TO-GO 20SMAL… 2023-10-07 2023-10-28 21 da…
## 5 KOOL! READY-TO-GO 7.5SMA… 2023-10-07 2023-10-28 21 da…
## 6 KOOL! ZERO SUGAR READY-TO-GO 7.5SMA… 2023-10-07 2023-10-28 21 da…
## 7 ORANGE VANILLA BUBBLE JOY ADVANTAGEOUS … 12SMAL… 2021-06-12 2021-07-03 21 da…
## 8 PAPI ZERO SUGAR 24SMAL… 2020-12-19 2021-01-09 21 da…
## 9 YAWN TROP 20SMAL… 2023-10-07 2023-10-28 21 da…
## 10 DIET BUBBLE JOY ADVANTAGEOUS PLANT-BASE… 7.5SMA… 2023-08-05 2023-08-19 14 da…
## # ℹ 135 more rows
#summarize top 10 brand or package less than 6 months, or 26 weeks, in duration for innovation set
df %>%
group_by(BRAND, PACKAGE) %>%
summarise(MIN_DATE = min(DATE),
MAX_DATE = max(DATE),
TENURE = MAX_DATE - MIN_DATE) %>%
filter(TENURE < 26) %>%
arrange(desc(TENURE)) %>%
head(10)
## # A tibble: 10 × 5
## # Groups: BRAND [9]
## BRAND PACKAGE MIN_DATE MAX_DATE TENURE
## <fct> <fct> <date> <date> <drtn>
## 1 BARS 2L MUL… 2022-10-29 2022-11-19 21 da…
## 2 BUBBLE JOY ADVANTAGEOUS MOON 7.5SMA… 2022-02-26 2022-03-19 21 da…
## 3 JUICY SQUIRREL 3L MUL… 2023-10-07 2023-10-28 21 da…
## 4 KOOL! READY-TO-GO 20SMAL… 2023-10-07 2023-10-28 21 da…
## 5 KOOL! READY-TO-GO 7.5SMA… 2023-10-07 2023-10-28 21 da…
## 6 KOOL! ZERO SUGAR READY-TO-GO 7.5SMA… 2023-10-07 2023-10-28 21 da…
## 7 ORANGE VANILLA BUBBLE JOY ADVANTAGEOUS … 12SMAL… 2021-06-12 2021-07-03 21 da…
## 8 PAPI ZERO SUGAR 24SMAL… 2020-12-19 2021-01-09 21 da…
## 9 YAWN TROP 20SMAL… 2023-10-07 2023-10-28 21 da…
## 10 DIET BUBBLE JOY ADVANTAGEOUS PLANT-BASE… 7.5SMA… 2023-08-05 2023-08-19 14 da…
#density plot of TENURE less than 6 months, or 26 weeks.
df %>%
group_by(BRAND, PACKAGE) %>%
summarise(MIN_DATE = min(DATE),
MAX_DATE = max(DATE),
TENURE = MAX_DATE - MIN_DATE) %>%
filter(TENURE < 26) %>%
ggplot(aes(x = TENURE)) +
geom_density() +
labs(title = "Density of Tenure less than 6 months",
x = "Tenure",
y = "Density")
145 Brand/Package sets under 6 months, with start and stop date to set as week 0. Will need to guarantee that each one is not missing week/date data for time series analysis. 9 of the top 10 brand/package sets from this set run 21 days, with the 10th running 14 days. October 7th is an extremely popular launch date. The most common tenure of the 145 brand/package combos with tenure less than 6 months run for less than 5 weeks in duration.
### Create table counting each item and how many brand
categories_count <- df %>%
group_by(ITEM) %>%
summarize(
num_manufacturers = n_distinct(MANUFACTURER),
num_category = n_distinct(CATEGORY),
num_market_key = n_distinct(MARKET_KEY),
num_caloric_segment = n_distinct(CALORIC_SEGMENT),
num_brand = n_distinct(BRAND),
num_package = n_distinct(PACKAGE)
)
summary(categories_count)
## ITEM num_manufacturers num_category num_market_key
## Length:3692 Min. :1 Min. :1 Min. : 1.00
## Class :character 1st Qu.:1 1st Qu.:1 1st Qu.: 3.00
## Mode :character Median :1 Median :1 Median : 40.00
## Mean :1 Mean :1 Mean : 82.14
## 3rd Qu.:1 3rd Qu.:1 3rd Qu.:189.00
## Max. :1 Max. :1 Max. :200.00
## num_caloric_segment num_brand num_package
## Min. :1.000 Min. :1.000 Min. :1.00
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.00
## Median :1.000 Median :1.000 Median :1.00
## Mean :1.013 Mean :1.026 Mean :1.01
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.00
## Max. :2.000 Max. :3.000 Max. :3.00
categories_count %>%
summarise(
count_more_than_one_caloric_segment = sum(num_caloric_segment > 1),
count_more_than_one_brand = sum(num_brand > 1),
count_more_than_one_package = sum(num_package > 1)
)
## # A tibble: 1 × 3
## count_more_than_one_caloric_se…¹ count_more_than_one_…² count_more_than_one_…³
## <int> <int> <int>
## 1 49 88 34
## # ℹ abbreviated names: ¹count_more_than_one_caloric_segment,
## # ²count_more_than_one_brand, ³count_more_than_one_package
#88 items fall into 2 or more brands
#49 items fall into 2 or more categories
#34 Items with 2 or more packages
df %>%
inner_join(categories_count %>%
group_by(ITEM) %>%
filter(sum(num_brand) > 1) %>%
select(ITEM),
by = "ITEM") %>%
select(ITEM, BRAND) %>%
arrange(ITEM) %>%
distinct(ITEM, BRAND) %>%
head(5)
## ITEM
## 1 AZURE HORIZON GENTLE DRINK SUPER-JUICE DURIAN CUP 12 LIQUID SMALL
## 2 AZURE HORIZON GENTLE DRINK SUPER-JUICE DURIAN CUP 12 LIQUID SMALL
## 3 CUPADA ARID GENTLE DRINK GINGER ALE AND ADE CUP 12 LIQUID SMALL X12
## 4 CUPADA ARID GENTLE DRINK GINGER ALE AND ADE CUP 12 LIQUID SMALL X12
## 5 CUPADA ARID GENTLE DRINK GINGER ALE CUP 12 LIQUID SMALL
## BRAND
## 1 AZURE HORIZON
## 2 DIET AZURE HORIZON
## 3 CUPADA ARID
## 4 DIET CUPADA ARID
## 5 CUPADA ARID
# This DF shows that many of the items with 2 types of brand are both in Diet and Regular category. If we are using these items as filters in our model building we will need to remember this.
df %>%
inner_join(categories_count %>%
group_by(ITEM) %>%
filter(sum(num_caloric_segment) > 1) %>%
select(ITEM),
by = "ITEM") %>%
select(ITEM, CALORIC_SEGMENT) %>%
arrange(ITEM) %>%
distinct(ITEM, CALORIC_SEGMENT) %>%
head(5)
## ITEM
## 1 AAE LIQUORICE REVITALIZING BOOST LIQUID FREEZE BURN FAT LS ENRGY TCHNL JUG 8 LIQUID SMALL
## 2 AAE LIQUORICE REVITALIZING BOOST LIQUID FREEZE BURN FAT LS ENRGY TCHNL JUG 8 LIQUID SMALL
## 3 AZURE HORIZON GENTLE DRINK SUPER-JUICE DURIAN CUP 12 LIQUID SMALL
## 4 AZURE HORIZON GENTLE DRINK SUPER-JUICE DURIAN CUP 12 LIQUID SMALL
## 5 CARBONATE STREAM ENERGY DRINK CONCENTRATE UNFLAVORED JUG 14.8 LIQUID SMALL
## CALORIC_SEGMENT
## 1 1
## 2 0
## 3 1
## 4 0
## 5 1
# As with the brand caloric segment has the same duplicate items in both segments. We will need to remember this in modeling
df %>%
inner_join(categories_count %>%
group_by(ITEM) %>%
filter(sum(num_package) > 1) %>%
select(ITEM),
by = "ITEM") %>%
select(ITEM, PACKAGE) %>%
arrange(ITEM) %>%
distinct(ITEM, PACKAGE) %>%
head(5)
## ITEM
## 1 BUBBLE JOY WATER-JUGD-CARBONATED CONTAINER 288 LIQUID SMALL
## 2 BUBBLE JOY WATER-JUGD-CARBONATED CONTAINER 288 LIQUID SMALL
## 3 CUPADA ARID GENTLE DRINK GINGER ALE JUG 10 LIQUID SMALL X6
## 4 CUPADA ARID GENTLE DRINK GINGER ALE JUG 10 LIQUID SMALL X6
## 5 CUPADA ARID TONIC WATER UNFLAVORED JUG 10 LIQUID SMALL X6
## PACKAGE
## 1 12SMALL 24ONE CUP
## 2 ALL OTHER ONES
## 3 ALL OTHER ONES
## 4 10SMALL 6ONE PLASTICS JUG
## 5 ALL OTHER ONES
#Even though these items have a package in the item description we their packaging category changes.
# We see that there are products that fall into multiple Caloric Segments, Brands and Packages. We will need to keep these items in mind when building models. When creating our smaller data sets for modeling we will want to assure that we filter by these 3 categories.
#Create table to summarize total sale days
sales_summary <- df %>%
group_by(ITEM) %>%
summarize(first_date = min(DATE), last_date = max(DATE), total_sales = sum(UNIT_SALES), total_revenue = sum(DOLLAR_SALES), total_sale_days = n_distinct(DATE))
#Calculate Total window of days sold
sales_summary <- sales_summary %>%
mutate(
duration_days = last_date - first_date,
duration_weeks = ceiling(as.numeric(duration_days) / 7),
launch13week_date =first_date + lubridate::weeks(13),
launch6month_date = first_date + months(6),
launch1year_date = first_date + lubridate::years(1),
avg_sales_per_week = ifelse(duration_weeks == 0, total_sales, total_sales / duration_weeks),
)
#Batch data in to categories 1 day sales, upto 13 week sales, 13 week to 6 months, 6 months to a year, more than a year = ongoing
sales_summary$sales_category <- cut(sales_summary$duration_weeks, breaks = c(-Inf, 0, 13, 26, 52, Inf),
labels = c("One Day Sales", "13 Week Sales", "6 Month Sales", "1 Year Sales", "Ongoing"))
# In the section we added sales summaries to our items. We see that when breaking up our items into sales groups we have a majority of items that are ongoing. These will most likely not be helpful when running our 7 questions based around limited sales.
ggplot(sales_summary, aes(x = sales_category)) +
geom_bar() +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.5, colour = "white")+
labs(title = "Count of Items in Sales Category",
x = "Category",
y = "Count")
#Create plot of sales category with average weekly sales
sales_summary %>%
filter(sales_category != "Ongoing") %>%
ggplot( aes(x = avg_sales_per_week, y = sales_category, color = sales_category)) +
geom_point() +
labs(title = "Average Weekly Sales by Sales Category",
x = "Average Weekly Sales",
y = "Sales Category",
color = "Sales Category")
# This plot shows us that in our short sale items we are highly weighted to the left with a few large outliers when creating our modeling sets we will want to do more outlier analysis on each group to see how to address each one.
df <- left_join(df,sales_summary %>% select(ITEM, sales_category, duration_days, duration_weeks, total_sale_days, first_date, launch13week_date, launch6month_date, launch1year_date), by = "ITEM")
# Calculate days since launch
df <- df %>%
mutate(days_since_launch = as.numeric(DATE - first_date),
weeks_since_launch = ceiling(as.numeric((DATE - first_date)/7)))
# Group by sales category and create separate line graphs for each sales category
df %>%
group_by(sales_category, weeks_since_launch) %>%
summarize(total_unit_sales = sum(UNIT_SALES)) %>%
group_by(sales_category) %>%
filter(total_unit_sales > 0) %>%
ggplot( aes(x = weeks_since_launch, y = total_unit_sales)) +
geom_line() +
labs(title = "Total Unit Sales by Weeks Since Launch",
x = "Weeks Since Launch",
y = "Total Unit Sales") +
facet_wrap(~ sales_category, scales = "free")
COMMENT: In these 5 line graphs we are given the shape of sales from launch date. We see that with the short term products a large spike starts and tapers off as time goes on. “Ongoing” products tend to start strong and then slowly start to fall over time. This demenstrates further that when modeling breaking our items into categories to model will help us be more accurate when looking at forecasting newe products for shorter amounts of time. It also reinforces that we should be able to exclude our ongoing products from the data.
# These graphs show us when forecasting for specific weeks of the year we will have large amounts of variance based on past sales.
# We may need to look at outliers in the 13 week and 1 year category.
# Group by sales category and create separate line graphs
df %>%
mutate(week_of_year = week(DATE)) %>%
group_by(sales_category, week_of_year) %>%
summarize(total_unit_sales = sum(UNIT_SALES)) %>%
group_by(sales_category) %>%
filter(total_unit_sales > 0) %>%
ggplot(aes(x = week_of_year, y = total_unit_sales)) +
geom_line() +
labs(title = "Total Unit Sales by Week of the Year",
x = "Week of the Year",
y = "Total Unit Sales") +
facet_wrap(~ sales_category, scales = "free_y")
Item Description: Diet Smash Plum 11Small 4One Caloric Segment: Diet Market Category: SSD Manufacturer: Swire-CC Brand: Diet Smash Package Type: 11Small 4One Flavor: ‘Plum’ Which 13 weeks of the year would this product perform best in the market? What is the forecasted demand, in weeks, for those 13 weeks?
library(stringr)
# Matching parameters for Q1
df %>%
filter(sales_category == "13 Week Sales",
CALORIC_SEGMENT == 0,
CATEGORY == "SSD") %>%
#str_detect(ITEM, "PLUM")) %>%
group_by(ITEM) %>%
summarize(distinct_items = n_distinct(ITEM))
## # A tibble: 37 × 2
## ITEM distinct_items
## <chr> <int>
## 1 AZURE HORIZON FREE GENTLE DRINK SUPER-JUICE DURIAN CALORIE … 1
## 2 CAFFEINE FREE DIET PAPI GENTLE DRINK COLA DIET JUG 16 LIQUID … 1
## 3 CAFFEINE FREE DIET RAINING GENTLE DRINK AVOCADO DIET CUP 12 … 1
## 4 CUPSHIELD'S GENTLE DRINK POWDER FUDYNAMOE DIET CUP 12 LIQUID… 1
## 5 CUPSHIELD'S TONIC WATER UNFLAVORED DIET JUG 33.8 LIQUID SMALL 1
## 6 DIET BUBBLE JOY ADVANTAGEOUS GENTLE DRINK COLA DIET JUG 12 LI… 1
## 7 DIET BUBBLE JOY ADVANTAGEOUS GENTLE DRINK COLA PINK GUAVA DI… 1
## 8 DIET BUBBLE JOY ADVANTAGEOUS GENTLE DRINK COLA PINK GUAVA DI… 1
## 9 DIET BUBBLE JOY ADVANTAGEOUS GENTLE DRINK DURIAN COLA DIET C… 1
## 10 DIET GORGEOUS SUNSET OUS GENTLE DRINK AVOCADO DIET CUP 12 LI… 1
## # ℹ 27 more rows
# There are 37 items that match time period caloric segment, category. flavor and packaging dont exist
df %>%
filter(str_detect(ITEM, "PLUM")) %>%
summarize(distinct_items = n_distinct(ITEM))
## distinct_items
## 1 64
#64 Plum Flavored items
#Distribution of matching items
df %>%
filter(sales_category == "13 Week Sales",
CALORIC_SEGMENT == 0,
CATEGORY == "SSD") %>%
group_by(ITEM, weeks_since_launch) %>%
summarize(total_unit_sales = sum(UNIT_SALES)) %>%
ggplot(aes(x = weeks_since_launch, y = total_unit_sales)) +
geom_bar(stat = "identity") +
labs(title = "Total Unit Sales by Week Since Launch",
x = "Week Since Launch",
y = "Total Unit Sales")
# Create Sales Category distribution of plum items
df %>%
filter(str_detect(ITEM, "PLUM")) %>%
group_by(sales_category) %>%
summarize(distinct_items = n_distinct(ITEM)) %>%
ggplot(aes(x = sales_category, y = distinct_items)) +
geom_bar(stat = "identity") +
geom_text(aes(label = distinct_items), vjust = -.5, color = "black") +
labs(title = "Count Distinct Items by Sales Category With Plum Flavor",
x = "Sales Category",
y = "Count of Distinct Items")
In this we found that there is a potential 37 itmes that have matching paramaters minus the flavor and packageing size. There are no itmes with the package size and 64 total plum flavored items.
Item Description: Sparkling Jacceptabletlester Avocado 11Small MLT Caloric Segment: Regular Market Category: SSD Manufacturer: Swire-CC Brand: Sparkling Jacceptabletlester SPARKLING JACCEPTABLETLESTER Package Type: 11Small MLT Flavor: ‘Avocado’ Swire plans to release this product 2 weeks prior to Easter and 2 weeks post Easter. What will the forecasted demand be, in weeks, for this product?
# Matching parameters for Q2
df %>%
filter(
month(first_date) %in% c(3, 4),
CALORIC_SEGMENT == 1,
CATEGORY == "SSD") %>%
#str_detect(ITEM, "AVOCADO") %>%
summarize(distinct_items = n_distinct(ITEM))
## distinct_items
## 1 148
#148 items match launching in either March or April, regular and SSD category. There are non with Avocado in this group
df %>%
filter(str_detect(ITEM, "AVOCADO")) %>%
summarize(distinct_items = n_distinct(ITEM))
## distinct_items
## 1 340
#340 AVOCADO Flavored items
#Distribution of matching items
df %>%
filter(
month(first_date) %in% c(3, 4),
CALORIC_SEGMENT == 1,
CATEGORY == "SSD",
sales_category != 'Ongoing') %>%
group_by(ITEM, days_since_launch) %>%
summarize(total_unit_sales = sum(UNIT_SALES)) %>%
ggplot(aes(x = days_since_launch, y = total_unit_sales)) +
geom_bar(stat = "identity") +
labs(title = "Total Unit Sales from Launch Date of Q2 Matching Products",
x = "Days Since Launch",
y = "Total Unit Sales")
# Create Sales Category distribution of AVACADO items
df %>%
filter(str_detect(ITEM, "AVOCADO")) %>%
group_by(sales_category) %>%
summarize(distinct_items = n_distinct(ITEM)) %>%
ggplot(aes(x = sales_category, y = distinct_items)) +
geom_bar(stat = "identity") +
geom_text(aes(label = distinct_items), vjust = -.5, color = "black") +
labs(title = "Distinct Items by Sales Category with AVOCADO Flavor",
x = "Sales Category",
y = "Count of Distinct Items")
From this section we are able to find a group of 148 items that launched either in March or April that have some of the matching features of product 2. We also have a much larger group or products that have avocado with 340 itmes.
Item Description: Diet Venomous Blast Energy Drink Kiwano 16 Liquid Small Caloric Segment: Diet Market Category: Energy Manufacturer: Swire-CC Brand: Venomous Blast Package Type: 16 Liquid Small Flavor: ’Kiwano’ Which 13 weeks of the year would this product perform best in the market? What is the forecasted demand, in weeks, for those 13 weeks?
# Matching parameters for Q3
df %>%
filter(
sales_category == "13 Week Sales",
CALORIC_SEGMENT == 0,
CATEGORY == "ENERGY",
#BRAND == "VENOMOUS BLAST",
#str_detect(ITEM, "KIWANO")
) %>%
group_by(ITEM) %>%
summarize(distinct_items = n_distinct(ITEM))
## # A tibble: 16 × 2
## ITEM distinct_items
## <chr> <int>
## 1 AAE LIQUORICE REVITALIZING BOOST LIQUID FREEZE AND BURN JUG 8… 1
## 2 AUTHENTIC SIP SPARKLES WATER KIWANO MIST ZERO CALORIES CUP 1… 1
## 3 AUTHENTIC SIP SPARKLES WATER MANDARIN YUZU ZERO CALORIES CUP … 1
## 4 AUTHENTIC SIP SPARKLES WATER WHITE POPPIN KEEN ZERO CALORIE… 1
## 5 KEKE ENERGY ENERGY REVITALIZING BOOST LIQUID JUG 1.93 LIQUID … 1
## 6 KEKE ENERGY ENERGY REVITALIZING BOOST LIQUID NUTRIENTS JUG 1.… 1
## 7 MYTHICAL BEVERAGE LO-CARB ENERGY DRINK UNFLAVORED CUP 8.3 LIQ… 1
## 8 MYTHICAL BEVERAGE MAXX ENERGY DRINK RAD RED ZERO SUGAR CUP 12… 1
## 9 MYTHICAL BEVERAGE REHAB DRINK FLAVORED ENERGY DRINK DRINK A… 1
## 10 MYTHICAL BEVERAGE REHAB ENERGY DRINK KIWANO CUP 15.5 LIQUID … 1
## 11 MYTHICAL BEVERAGE ULTRA SUNRISE ENERGY DRINK ULTRA SUNRISE ZE… 1
## 12 MYTHICAL BEVERAGE ZERO ULTRA ENERGY DRINK UNFLAVORED CUP 12 L… 1
## 13 POW-POW DIETARY HEALTH SUPPLEMENT LIQUID POTENT BRAIN AND BOD… 1
## 14 REAL-TIME THE KEEN EDITION ENERGY DRINK CRISP KEEN SUGAR FR… 1
## 15 RULE TEMPEST REVITALIZING BOOST LIQUID CLEAN ENERGY CUP 12 CO… 1
## 16 RULE TEMPEST REVITALIZING BOOST LIQUID CLEAN ENERGY CUP 12 LI… 1
# There are only 16 items that match time period caloric segment, category. flavor, packaging and brand dont exist
df %>%
filter(str_detect(ITEM, "KIWANO")) %>%
summarize(distinct_items = n_distinct(ITEM))
## distinct_items
## 1 76
#76 Kiwano Flavored items
#Distribution of matching items
df %>%
filter(sales_category == "13 Week Sales",
CALORIC_SEGMENT == 0,
CATEGORY == "ENERGY") %>%
group_by(ITEM, weeks_since_launch) %>%
summarize(total_unit_sales = sum(UNIT_SALES)) %>%
ggplot(aes(x = weeks_since_launch, y = total_unit_sales)) +
geom_bar(stat = "identity") +
labs(title = "Total Unit Sales by Week Since Launch",
x = "Week Since Launch",
y = "Total Unit Sales")
# Create Sales Category distribution of Kiwano items
df %>%
filter(str_detect(ITEM, "KIWANO")) %>%
group_by(sales_category) %>%
summarize(distinct_items = n_distinct(ITEM)) %>%
ggplot(aes(x = sales_category, y = distinct_items)) +
geom_bar(stat = "identity") +
geom_text(aes(label = distinct_items), vjust = -.5, color = "black") +
labs(title = "Count Distinct Items by Sales Category With Kiwano Flavor",
x = "Sales Category",
y = "Count of Distinct Items")
In this section we find that there is signficatnly less data around energy drinks that match our category. Overall there is 76 current historical KIWANO flavored items with only 3 of those being sold for more than one day but less than 13 weeks.
Item Description: Diet Square Mulberries Sparkling Water 10Small MLT Caloric Segment: Diet Market Category: Sparkling Water Manufacturer: Swire-CC Brand: Square Package Type: 10Small MLT Flavor: Mulberries Swire plans to release this product for the duration of 1 year but only in the Northern region. What will the forecasted demand be, in weeks, for this product?
# Matching parameters for Q4
df %>%
filter(
sales_category == "1 Year Sales",
CALORIC_SEGMENT == 0,
CATEGORY == "SPARKLING WATER") %>%
#str_detect(ITEM, "MULBERRIES")) %>%
summarize(distinct_items = n_distinct(ITEM))
## distinct_items
## 1 34
#34 items match 1 year launch sales, diet and Sparkling Water category. There are non with Mulberries in this group
df %>%
filter(str_detect(ITEM, "MULBERRIES")) %>%
summarize(distinct_items = n_distinct(ITEM))
## distinct_items
## 1 26
#26 Mulberries Flavored items
#Distribution of matching items
df %>%
filter(
sales_category == "1 Year Sales",
CALORIC_SEGMENT == 0,
CATEGORY == "SPARKLING WATER",
sales_category != 'Ongoing') %>%
group_by(ITEM, days_since_launch) %>%
summarize(total_unit_sales = sum(UNIT_SALES)) %>%
ggplot(aes(x = days_since_launch, y = total_unit_sales)) +
geom_bar(stat = "identity") +
labs(title = "Total Unit Sales from Launch Date of Q4 Matching Products",
x = "Days Since Launch",
y = "Total Unit Sales")
# Create Sales Category distribution of Mulberries items
df %>%
filter(str_detect(ITEM, "MULBERRIES")) %>%
group_by(sales_category) %>%
summarize(distinct_items = n_distinct(ITEM)) %>%
ggplot(aes(x = sales_category, y = distinct_items)) +
geom_bar(stat = "identity") +
geom_text(aes(label = distinct_items), vjust = -.5, color = "black") +
labs(title = "Distinct Items by Sales Category with Mulberry Flavor",
x = "Sales Category",
y = "Count of Distinct Items")
In this section as with the energy drinks, we find there is much less data around sparkling water with only 34 items matching category, segment and sales category. We will need to add in demographic data for just sales in the northern region when setting up modeling. Also, of interest there currently has been no Mulberry products placed on the market for only 1 year.
Item Description: Greetingle Health Beverage Woodsy Yellow .5L 12One Jug Caloric Segment: Regular Market Category: ING Enhanced Water Manufacturer: Swire-CC Brand: Greetingle Package Type: .5L 12One Jug Flavor: ‘Woodsy Yellow’ Swire plans to release this product for 13 weeks, but only in one region. Which region would it perform best in?
# Matching parameters for Q5
df %>%
filter(
#sales_category == "13 Week Sales",
CALORIC_SEGMENT == 1,
CATEGORY == "ING ENHANCED WATER",
#BRAND == "GREETINGLE",
#str_detect(ITEM, "WOODSY YELLOW")
) %>%
group_by(ITEM) %>%
summarize(distinct_items = n_distinct(ITEM))
## # A tibble: 55 × 2
## ITEM distinct_items
## <chr> <int>
## 1 MYTHICAL BEVERAGE HYDRO ENERGY DRINK BLUE ICE JUG 25.4 LIQUID… 1
## 2 MYTHICAL BEVERAGE HYDRO ENERGY DRINK MANIC CANES CUP 16.9 LI… 1
## 3 MYTHICAL BEVERAGE HYDRO ENERGY DRINK MANIC CANES JUG 25.4 LI… 1
## 4 MYTHICAL BEVERAGE HYDRO ENERGY DRINK MEAN CUSTARD APPLE CUP … 1
## 5 MYTHICAL BEVERAGE HYDRO ENERGY DRINK MEAN CUSTARD APPLE JUG … 1
## 6 MYTHICAL BEVERAGE HYDRO ENERGY DRINK PURPLE EXCITEMENT JUG 2… 1
## 7 MYTHICAL BEVERAGE HYDRO ENERGY DRINK WOODSY THUNDER CUP 16.9… 1
## 8 MYTHICAL BEVERAGE HYDRO ENERGY DRINK WOODSY THUNDER JUG 25.4… 1
## 9 MYTHICAL BEVERAGE HYDRO ENERGY WATER BLUE ICE JUG 20 LIQUID S… 1
## 10 MYTHICAL BEVERAGE HYDRO ENERGY WATER BLUE ICE JUG 20 LIQUID S… 1
## # ℹ 45 more rows
# There are only 55 items that match Caloric Segment and Category. This one the sales cateogry also does not have matches.
df %>%
filter(str_detect(ITEM, "WOODSY YELLOW")) %>%
summarize(distinct_items = n_distinct(ITEM))
## distinct_items
## 1 0
#0 Flavored items
#Distribution of matching items
df %>%
filter(#sales_category == "13 Week Sales",
CALORIC_SEGMENT == 1,
CATEGORY == "ING ENHANCED WATER") %>%
group_by(ITEM, weeks_since_launch) %>%
summarize(total_unit_sales = sum(UNIT_SALES)) %>%
ggplot(aes(x = weeks_since_launch, y = total_unit_sales)) +
geom_bar(stat = "identity") +
labs(title = "Total Unit Sales by Week Since Launch",
x = "Week Since Launch",
y = "Total Unit Sales")
This product proved to be the least common to our current data. The enhanced water segement is very small and has not had any limited releases like this before. Also there has been no sales of this flavor in the past. This will be one where our estimate will have many larget assumptions especially once we bring in the Demographic of only selling in one region.
Item Description: Diet Energy Moonlit Casava 2L Multi Jug Caloric Segment: Diet Market Category: Energy Manufacturer: Swire-CC Brand: Diet Moonlit Package Type: 2L Multi Jug Flavor: ‘Cassava’ Swire plans to release this product for 6 months. What will the forecasted demand be, in weeks, for this product?
# Matching parameters for Q6
df %>%
filter(
sales_category == "6 Month Sales",
CALORIC_SEGMENT == 0,
CATEGORY == "ENERGY"
#str_detect(ITEM, "CASSAVA")
) %>%
summarize(distinct_items = n_distinct(ITEM))
## distinct_items
## 1 10
#10 items match 6 Month launch sales, diet and Energy category. There are non with Cassava in this group
df %>%
filter(str_detect(ITEM, "CASSAVA")) %>%
summarize(distinct_items = n_distinct(ITEM))
## distinct_items
## 1 0
# 0 Cassava Flavored items
#Distribution of matching items
df %>%
filter(
sales_category == "6 Month Sales",
CALORIC_SEGMENT == 0,
CATEGORY == "ENERGY",
sales_category != 'Ongoing') %>%
group_by(ITEM, days_since_launch) %>%
summarize(total_unit_sales = sum(UNIT_SALES)) %>%
ggplot(aes(x = days_since_launch, y = total_unit_sales)) +
geom_bar(stat = "identity") +
labs(title = "Total Unit Sales from Launch Date of Q6 Matching Products",
x = "Days Since Launch",
y = "Total Unit Sales")
# Distribution of Sales by month of the year
df %>%
filter(
sales_category == "6 Month Sales",
CALORIC_SEGMENT == 0,
CATEGORY == "ENERGY",
sales_category != 'Ongoing') %>%
group_by(ITEM, MONTH) %>%
summarize(total_unit_sales = sum(UNIT_SALES)) %>%
ggplot(aes(x = MONTH, y = total_unit_sales)) +
geom_bar(stat = "identity") +
labs(title = "Total Unit Sales of 6 Month Sales by Month of the Year",
x = "MONTH",
y = "Total Unit Sales")
This product grouping as with the last will need to be built on many assuptions. There currently is no other products with CASSAVA in the market. There are 10 itmes that have been sold for 6 months at a time that are in the energy drink category. For this we will need to expand out our data set to hone in on which 6 months of the year would be best. In a graph of sales we do see a window of time that these matching products have sold the most, Feb - July.
Item Description: Peppy Gentle Drink Pink Woodsy .5L Multi Jug Caloric Segment: Regular Type: SSD Manufacturer: Swire-CC Brand: Peppy Package Type: .5L Multi Jug Flavor: ‘Pink Woodsy’ Swire plans to release this product in the Southern region for 13 weeks. What will the forecasted demand be, in weeks, for this product?
# Matching parameters for Q3
df %>%
filter(
sales_category == "13 Week Sales",
CALORIC_SEGMENT == 1,
CATEGORY == "SSD",
#BRAND == "PEPPY",
#str_detect(ITEM, "PINK WOODSY")
) %>%
group_by(ITEM) %>%
summarize(distinct_items = n_distinct(ITEM))
## # A tibble: 62 × 2
## ITEM distinct_items
## <chr> <int>
## 1 AZURE HORIZON GENTLE DRINK WILD PINK CUP 12 LIQUID SMALL 1
## 2 BARS GENTLE DRINK PINA JUG 67.6 LIQUID SMALL 1
## 3 BARS GENTLE DRINK PONCHE WOODSY JUG 67.6 LIQUID SMALL 1
## 4 CUPSHIELD'S TONIC WATER UNFLAVORED JUG 10 LIQUID SMALL 1
## 5 DESERT REFRESHMENT GENTLE DRINK SUNSET CASAVA BLAST CUP 12 … 1
## 6 ELF BUBBLES GENTLE DRINK MELLOW D MIXED-TROPPY JUG 16.9 LIQU… 1
## 7 ELF BUBBLES GENTLE DRINK RED SUNSET SUPER-JUICE DURIAN JU… 1
## 8 ELF BUBBLES GENTLE DRINK SUMMER MELLOW D MIXED-TROPPY SUPER-… 1
## 9 FANTASMIC GENTLE DRINK BERRY CUP 12 LIQUID SMALL X12 1
## 10 FANTASMIC GENTLE DRINK CASAVA JUG 12 LIQUID SMALL X4 1
## # ℹ 52 more rows
# There are 62 items that match time period caloric segment, category. flavor, packaging and brand combination do not exist
df %>%
filter(str_detect(ITEM, "PINK WOODSY")) %>%
summarize(distinct_items = n_distinct(ITEM))
## distinct_items
## 1 0
#0 items have Pink Woodsy Flavored items
#Distribution of matching items
df %>%
filter(sales_category == "13 Week Sales",
CALORIC_SEGMENT == 1,
CATEGORY == "SSD") %>%
group_by(ITEM, weeks_since_launch) %>%
summarize(total_unit_sales = sum(UNIT_SALES)) %>%
ggplot(aes(x = weeks_since_launch, y = total_unit_sales)) +
geom_bar(stat = "identity") +
labs(title = "Total Unit Sales by Week Since Launch",
x = "Week Since Launch",
y = "Total Unit Sales"
)
This product does have more compariable data since it is in the SSD category. We should be able to create a good data set once linked with the region data around a 13 week forecast. The distribution of the 13 week sales products in the SSD category follows the general distribution with large sales in week 0 and 1 very small sales in the middle and a bump in the final weeks.
rm(categories_count)
rm(sales_summary)
rm(df)
| State | Total Market Pop | Household Count | # of Market Segments | # of Zip Codes | # of Adult Males | # of Adult Females |
|---|---|---|---|---|---|---|
| WA | 5,626,119 | 2,896,229 | 33 | 477 | 2,779,456 | 2,846,661 |
| AZ | 5,411,596 | 2,670,174 | 56 | 352 | 2,637,239 | 2,774,348 |
| CO | 4,351,983 | 2,259,753 | 35 | 395 | 2,160,955 | 2,191,051 |
| OR | 2,934,725 | 1,507,859 | 23 | 281 | 1,434,336 | 1,500,381 |
| UT | 2,209,746 | 1,004,513 | 14 | 211 | 1,098,924 | 1,110,831 |
| ID | 1,385,341 | 699,690 | 12 | 221 | 684,024 | 701,329 |
| NM | 864,766 | 444,419 | 13 | 80 | 417,194 | 447,574 |
| CA | 846,163 | 357,260 | 28 | 61 | 417,686 | 428,482 |
| NV | 618,654 | 311,145 | 10 | 70 | 311,641 | 307,009 |
| WY | 162,927 | 90,222 | 4 | 23 | 82,158 | 80,771 |
| KS | 66,474 | 37,490 | 6 | 22 | 32,329 | 34,147 |
| NE | 59,441 | 33,767 | 2 | 29 | 29,097 | 30,345 |
| SD | 40,253 | 20,262 | 1 | 10 | 19,711 | 20,539 |
| Total | 24,578,188 | 12,332,783 | 200 | 2,232 | 12,104,750 | 12,473,468 |
#The data collected on household income is poorly distributed.
#Makes it appear that most people are wealthy, in the 150K+ group.
#We will consider combining household income groups into 50K segments to capture a more realistic distribution.
combined_plots #aggregated by Household Income (averages and counts)
combined_pie_charts
# The Lifestyle plot gives us an idea of how the population is spread amongst the various Zip codes that belong to all of the Swire (& Competitors) market areas. More than half of all consumers live in the suburban sprawl. As a reminder the total number of adults included in the market area was 24,578,188.
lifestyle_plot
# Typical Men - more in the earlier years, dying off by middle age faster than women.
age_group_plot